Feature Engineering in a Pipeline
Introduction
Feature engineering is a process of transforming the given dataset into a form which is easy for the machine learning model to interpret. If we have different transformation functions for training and prediction we may duplicate the same work and it’s harder to maintain (make some changes in one pipeline means we have to update the other pipeline as well).
One common practice in producitionzing machine learning models is to write a transformation pipeline so that we can use the same data transformation code for both training and prediction.
In this article, we discuss how we can use scikit-learn to build a feature engineering pipeline. Let’s first have a look at a few common transformations for numeric features and categorical features.
Transforming Numerical Features
One thing I really like about scikit-learn is that I can use the same ‘‘fit’’ and ‘‘predict’’ pattern for data preprocessing. For a preprocessor, the two methods are called fit
and transform
.
We can use SimpleImputer
to complete missing values and StandardScaler
to standardize values by removing the mean and scaling to unit variance.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
Let’s create a simple example.
data = {'n1': [20, 300, 400, None, 100],
'n2': [0.1, None, 0.5, 0.6, None],
'n3': [-20, -10, 0, -30, None],
}
df = pd.DataFrame(data)
df
n1 | n2 | n3 | |
---|---|---|---|
0 | 20.0 | 0.1 | -20.0 |
1 | 300.0 | NaN | -10.0 |
2 | 400.0 | 0.5 | 0.0 |
3 | NaN | 0.6 | -30.0 |
4 | 100.0 | NaN | NaN |
We can have a look the mean of each column using the .mean()
method.
df.mean()
n1 205.0
n2 0.4
n3 -15.0
dtype: float64
Here we create a SimpleImputer
object with strategy="mean"
. This means we fill the missing value using the mean along each column.
num_imputer = SimpleImputer(strategy="mean")
We first fit our imputer num_imputer
on our simple dataset.
num_imputer.fit(df)
SimpleImputer()
After fitting the model, the statistic, i.e., the fill value for each column, is stored within the imputer num_imputer
.
num_imputer.statistics_
array([205. , 0.4, -15. ])
Now we can fill the missing values in our original dataset with the transform
method. By the way, we can also apply fit and transform in one go with the fit_transform
method.
imputed_features = num_imputer.transform(df)
imputed_features
array([[ 2.00e+01, 1.00e-01, -2.00e+01],
[ 3.00e+02, 4.00e-01, -1.00e+01],
[ 4.00e+02, 5.00e-01, 0.00e+00],
[ 2.05e+02, 6.00e-01, -3.00e+01],
[ 1.00e+02, 4.00e-01, -1.50e+01]])
type(imputed_features)
numpy.ndarray
The transformed features are stored as numpy.ndarray
. We can convert it back to pandas.DataFrame
with
imputed_df = pd.DataFrame(imputed_features,
index=df.index, columns=df.columns)
imputed_df
n1 | n2 | n3 | |
---|---|---|---|
0 | 20.0 | 0.1 | -20.0 |
1 | 300.0 | 0.4 | -10.0 |
2 | 400.0 | 0.5 | 0.0 |
3 | 205.0 | 0.6 | -30.0 |
4 | 100.0 | 0.4 | -15.0 |
The cool thing is that now we can use the same statistic saved in num_imputer
to transform other datasets. For example here we create a new dataset with only one row.
# New data
data_new = {'n1': [None],
'n2': [0.1],
'n3': [None],
}
df_new = pd.DataFrame(data_new)
df_new
n1 | n2 | n3 | |
---|---|---|---|
0 | None | 0.1 | None |
We can apply num_imputer.transform
on this new dataset to fill the missing values.
pd.DataFrame(num_imputer.transform(df_new),
index=df_new.index, columns=df_new.columns)
n1 | n2 | n3 | |
---|---|---|---|
0 | 205.0 | 0.1 | -15.0 |
StandardScaler
works in a similar way. Here we scale the dataset after the imputer step.
num_scaler = StandardScaler()
num_scaler.fit(imputed_df)
StandardScaler()
pd.DataFrame(num_scaler.transform(imputed_df),
index=df.index, columns=df.columns)
n1 | n2 | n3 | |
---|---|---|---|
0 | -1.361620 | -1.792843e+00 | -0.5 |
1 | 0.699210 | -3.317426e-16 | 0.5 |
2 | 1.435221 | 5.976143e-01 | 1.5 |
3 | 0.000000 | 1.195229e+00 | -1.5 |
4 | -0.772811 | -3.317426e-16 | 0.0 |
Transforming Categorical Features
OneHotEncoder
is commonly used to transform categorical features. Essentially, for each unique value in the original categorical column, a new column is created to represent this value. Each column is filled up with zeros (the value exists) and ones (the value doesn’t exist).
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder(handle_unknown='ignore')
data = {'c1': ['Male', 'Female', 'Male', 'Female', 'Female'],
'c2': ['Apple', 'Orange', 'Apple', 'Banana', 'Pear'],
}
df = pd.DataFrame(data)
df
c1 | c2 | |
---|---|---|
0 | Male | Apple |
1 | Female | Orange |
2 | Male | Apple |
3 | Female | Banana |
4 | Female | Pear |
Let’s first fit
a one hot encoder to a dataset.
cat_encoder.fit(df)
OneHotEncoder(handle_unknown='ignore')
Note that the categories of each column is stored in attribute .categories_
.
cat_encoder.categories_
[array(['Female', 'Male'], dtype=object),
array(['Apple', 'Banana', 'Orange', 'Pear'], dtype=object)]
Here is the encoded dataset.
pd.DataFrame(cat_encoder.transform(df).toarray(),
index=df.index, columns=cat_encoder.get_feature_names_out())
c1_Female | c1_Male | c2_Apple | c2_Banana | c2_Orange | c2_Pear | |
---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
We can now use cat_encoder
to transform
new dataset.
data_new = {'c1': ['Female'], 'c2': ['Orange']}
df_new = pd.DataFrame(data_new)
df_new
c1 | c2 | |
---|---|---|
0 | Female | Orange |
pd.DataFrame(cat_encoder.transform(df_new).toarray(),
index=df_new.index, columns=cat_encoder.get_feature_names_out())
c1_Female | c1_Male | c2_Apple | c2_Banana | c2_Orange | c2_Pear | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Building a Feature Engineering Pipeline
Make a Pipeline
For numerical features, we can make a pipeline
to first fill the missing values with median and then apply standard scaler; for categorical features, we can make a pipeline
to first fill the missing values with the word “missing” and then apply one hot encoder.
from sklearn.pipeline import make_pipeline
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"),
StandardScaler())
categorical_transformer = make_pipeline(
SimpleImputer(strategy="constant", fill_value="missing"),
OneHotEncoder(handle_unknown="ignore"),)
The transformer pipelines can be used the same way as the individual transformers, i.e., we can fit
a pipeline with some data and use this pipeline to transform
new data. For example,
data = {'n1': [20, 300, 400, None, 100],
'n2': [0.1, None, 0.5, 0.6, None],
'n3': [-20, -10, 0, -30, None],
}
df = pd.DataFrame(data)
df
n1 | n2 | n3 | |
---|---|---|---|
0 | 20.0 | 0.1 | -20.0 |
1 | 300.0 | NaN | -10.0 |
2 | 400.0 | 0.5 | 0.0 |
3 | NaN | 0.6 | -30.0 |
4 | 100.0 | NaN | NaN |
numeric_transformer.fit(df)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler())])
Notice that the result is exactly the same as the example we give before (apply imputer and then scaler seperately).
pd.DataFrame(numeric_transformer.transform(df), index=df.index, columns=df.columns)
n1 | n2 | n3 | |
---|---|---|---|
0 | -1.354113 | -1.950034 | -0.5 |
1 | 0.706494 | 0.344124 | 0.5 |
2 | 1.442425 | 0.344124 | 1.5 |
3 | -0.029437 | 0.917663 | -1.5 |
4 | -0.765368 | 0.344124 | 0.0 |
Compose a Column Transformer
For a real life dataset we may have both numeric features and categorical features. It would be nice to selectively apply numeric transformation on the numeric features and categorical transformation on the categorical features. We can accomplish this goal by composing a ColumnTransformer
.
The example below has columns with numeric values ('n1'
, 'n2'
, 'n3'
) and categorical values ('c1'
, 'c2'
).
data = {'n1': [20, 300, 400, None, 100],
'n2': [0.1, None, 0.5, 0.6, None],
'n3': [-20, -10, 0, -30, None],
'c1': ['Male', 'Female', None, 'Female', 'Female'],
'c2': ['Apple', 'Orange', 'Apple', 'Banana', 'Pear'],
}
df = pd.DataFrame(data)
df
n1 | n2 | n3 | c1 | c2 | |
---|---|---|---|---|---|
0 | 20.0 | 0.1 | -20.0 | Male | Apple |
1 | 300.0 | NaN | -10.0 | Female | Orange |
2 | 400.0 | 0.5 | 0.0 | None | Apple |
3 | NaN | 0.6 | -30.0 | Female | Banana |
4 | 100.0 | NaN | NaN | Female | Pear |
A ColumnTransformer
stores a list of (name, transformer, columns) tuples as transformers
, which allows different columns to be transformed separately.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, ["n1", "n2", "n3"]),
("cat", categorical_transformer, ["c1", "c2"]),
]
)
We fit
all transformers on dataset df
, transform dataset df
, and concatenate the results with method fit_transform
.
preprocessor.fit_transform(df)
array([[-1.35411306, -1.95003374, -0.5 , 0. , 1. ,
0. , 1. , 0. , 0. , 0. ],
[ 0.70649377, 0.3441236 , 0.5 , 1. , 0. ,
0. , 0. , 0. , 1. , 0. ],
[ 1.44242478, 0.3441236 , 1.5 , 0. , 0. ,
1. , 1. , 0. , 0. , 0. ],
[-0.02943724, 0.91766294, -1.5 , 1. , 0. ,
0. , 0. , 1. , 0. , 0. ],
[-0.76536825, 0.3441236 , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 1. ]])
After fitting the transformers, we can use preprocessor
on new dataset.
data_new = {'n1': [10],
'n2': [None],
'n3': [-10],
'c1': ['Male'],
'c2': [None],
}
df_new = pd.DataFrame(data_new)
df_new
n1 | n2 | n3 | c1 | c2 | |
---|---|---|---|---|---|
0 | 10 | None | -10 | Male | None |
preprocessor.transform(df_new)
array([[-1.42770616, 0.3441236 , 0.5 , 0. , 1. ,
0. , 0. , 0. , 0. , 0. ]])
Design Your Own Transformers
We can design custom transformers by defining a subclass of BaseEstimator
and TransformerMixin
. There are three methods we need to implement: __init__
, fit
, and transform
.
In the example below, we design a simple transformer to first fill missing values with zeros and divide the values by 10.
from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self) -> None:
pass
def fit(self, X: pd.DataFrame, y=None):
return self
def transform(self, X: pd.DataFrame, y=None):
X = X.fillna(0)
return X/10
Once the custom transformer is initialized, it can be used the same way as any other transformers we discussed before. Here we use the custom transformer on column "n3"
.
custom_tansformer = CustomTransformer()
preprocessor_custom = ColumnTransformer(
transformers=[
("num", numeric_transformer, ["n1", "n2"]),
("custom", custom_tansformer, ["n3"]),
("cat", categorical_transformer, ["c1", "c2"]),
]
)
preprocessor_custom.fit_transform(df)
array([[-1.35411306, -1.95003374, -2. , 0. , 1. ,
0. , 1. , 0. , 0. , 0. ],
[ 0.70649377, 0.3441236 , -1. , 1. , 0. ,
0. , 0. , 0. , 1. , 0. ],
[ 1.44242478, 0.3441236 , 0. , 0. , 0. ,
1. , 1. , 0. , 0. , 0. ],
[-0.02943724, 0.91766294, -3. , 1. , 0. ,
0. , 0. , 1. , 0. , 0. ],
[-0.76536825, 0.3441236 , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 1. ]])
Conclusion
In summary, we discussed how data transformation can be constructed as a pipeline. We can fit a data transformation pipeline on our training dataset and use the same pipeline to transform new dataset.