Using a pipeline to preprocess your data offers some substantive advantages. A pipeline guarantees that no information from the test set is used in preprocessing or training the model. Pipelines are often combined with cross-validation to find the best parameter combination of a machine learning algorithm. However, the implemented preprocessing steps, for example whether to scale the data, or the implemented machine learning algorithm can also be seen as a hyperparameter; not of a single model but of the whole training process. We can therefore tune them as such to further improve our model’s performance. In this post, I will show you how to do it with sci-kit learn!
We start with the required packages:
1
2
3
4
5
6
7
8
9
10
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import f1_score, classification_report
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
We are again working with the Titanic data set.
1
2
3
titanic = pd.read_csv('./titanic.csv')
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0 | Mahon Miss. Bridget Delia | female | NaN | 0 | 0 | 330924 | 7.8792 | NaN | Q | NaN | NaN | NaN |
1 | 1 | 0 | Clifford Mr. George Quincy | male | NaN | 0 | 0 | 110465 | 52.0000 | A14 | S | NaN | NaN | Stoughton MA |
2 | 3 | 0 | Yasbeck Mr. Antoni | male | 27.0 | 1 | 0 | 2659 | 14.4542 | NaN | C | C | NaN | NaN |
3 | 3 | 1 | Tenglin Mr. Gunnar Isidor | male | 25.0 | 0 | 0 | 350033 | 7.7958 | NaN | S | 13 15 | NaN | NaN |
4 | 3 | 0 | Kelly Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | NaN | 70.0 | NaN |
Since we we will use the test data (in cross-validation) to make model-relevant decisions, such as what preprocessing steps we should perform, we need fresh, yet unseen data to obtain a valid estimate of our final model’s out-of-sample performance. This is the same reason why we perform cross-validation in the first place! Nested cross-validation is an option here, but I leave it to creating a final hold-out set here:
1
2
3
4
X = titanic.drop('survived', axis = 1)
y = titanic.survived
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 42)
Following the last post, we create a pipeline including a ColumnTransformer (‘preprocessor’) that imputes the missing values, creates dummy variables for the categorical features and scales the numeric features.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
categorical_features = ['pclass', 'sex', 'embarked']
categorical_transformer = Pipeline(
[
('imputer_cat', SimpleImputer(strategy = 'constant', fill_value = 'missing')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
]
)
numeric_features = ['age', 'sibsp', 'parch', 'fare']
numeric_transformer = Pipeline(
[
('imputer_num', SimpleImputer()),
('scaler', StandardScaler())
]
)
preprocessor = ColumnTransformer(
[
('categoricals', categorical_transformer, categorical_features),
('numericals', numeric_transformer, numeric_features)
],
remainder = 'drop'
)
In the end, we include this preprocessor in our pipeline.
1
2
3
4
5
6
pipeline = Pipeline(
[
('preprocessing', preprocessor),
('clf', LogisticRegression())
]
)
Tuning the machine learning algorithm
In the same way we provide a list of hyperparameters of a machine learning algorithm in a parameter grid to find the best parameter combination, we can also fill in the machine learning algorithm itself as a “hyperparameter”. ('clf', LogisticRegression())
above is simply a placeholder where other machine learning algorithms can be filled in. In the grid below, I first try out a logistic regression and, second, a random forest classifier. Note that the parameters need to be a list of dictionaries because both models possess different parameter values to tune.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
params = [
{
'clf': [LogisticRegression()],
'clf__solver': ['liblinear'],
'clf__penalty': ['l1', 'l2'],
'clf__C': [0.01, 0.1, 1, 10, 100],
'clf__random_state': [42],
},
{
'clf': [RandomForestClassifier()],
'clf__n_estimators': [5, 50, 100, 250],
'clf__max_depth': [5, 8, 10],
'clf__random_state': [42],
}
]
Tuning the preprocessing steps
Next, we take care of tuning the preprocessing steps. We add them as parameters in the parameter grid by inserting their names given in the pipeline above: The StandardScaler()
to preprocess numericals can be addressed by 'preprocessing__numericals__scaler'
. 'preprocessing'
addresses the pipeline step, which is our ColumnTransformer, '__numericals'
addresses the pipeline for numericals inside this ColumnTransformer, and '__scaler'
addresses the StandardScaler in this particular pipeline. We could modify the StandardScaler here, for example by giving 'preprocessing__scaler__with_std': ['False']
, but we can also set whether standardizing is performed at all. By passing the list [StandardScaler(), 'passthrough']
to the 'scaler'
step, we either use the StandardScaler()
in this step or no transformer at all (with 'passthrough'
). By this, we can evaluate how our model performance changes if we do not standardize at all! The same is true for the imputer: We can try out whether the mean or median delivers better performance in this particular cross-validation process.
Below you find the complete parameter grid with all mentioned parameters included:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
params = [
{
'clf': [LogisticRegression()],
'clf__solver': ['liblinear'],
'clf__penalty': ['l1', 'l2'],
'clf__C': [0.01, 0.1, 1, 10, 100],
'clf__random_state': [42],
'preprocessing__numericals__scaler': [StandardScaler(), 'passthrough'],
'preprocessing__numericals__imputer_num__strategy': ['mean', 'median']
},
{
'clf': [RandomForestClassifier()],
'clf__n_estimators': [5, 50, 100, 250],
'clf__max_depth': [5, 8, 10],
'clf__random_state': [42],
'preprocessing__numericals__scaler': [StandardScaler(), 'passthrough'],
'preprocessing__numericals__imputer_num__strategy': ['mean', 'median']
}
]
One last thing: If you wish to modify the StandardScaler()
, e. g. by setting with_mean
, you would need to do this at the last point where you declare what to fill into the 'scaler'
step. Here, this would be 'preprocessing__numericals__scaler': [StandardScaler(with_mean = False), 'passthrough']
.
Let’s see what preprocessing steps and machine learning algorithm performs best:
1
2
3
4
5
6
7
8
9
rskf = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 2, random_state = 42)
cv = GridSearchCV(pipeline, params, cv = rskf, scoring = ['f1', 'accuracy'], refit = 'f1', n_jobs = -1)
cv.fit(X_train, y_train)
print(f'Best F1-score: {cv.best_score_:.3f}\n')
print(f'Best parameter set: {cv.best_params_}\n')
print(f'Scores: {classification_report(y_train, cv.predict(X_train))}')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Best F1-score: 0.722
Best parameter set: {'clf': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=8, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False), 'clf__max_depth': 8, 'clf__n_estimators': 50, 'clf__random_state': 42, 'preprocessing__numericals__imputer_num__strategy': 'median', 'preprocessing__numericals__scaler': StandardScaler(copy=True, with_mean=True, with_std=True)}
Scores: precision recall f1-score support
0 0.87 0.95 0.91 647
1 0.91 0.77 0.83 400
accuracy 0.88 1047
macro avg 0.89 0.86 0.87 1047
weighted avg 0.88 0.88 0.88 1047
Our best estimator is a random forest with max_depth = 8
, n_estimators = 50
, imputation by median and standardized numericals.
How do we do on completely new, yet unseen data?
1
2
3
preds = cv.predict(X_holdout)
print(f'Scores: {classification_report(y_holdout, preds)}\n')
print(f'F1-score: {f1_score(y_holdout, preds):.3f}')
1
2
3
4
5
6
7
8
9
10
11
Scores: precision recall f1-score support
0 0.83 0.88 0.86 162
1 0.79 0.71 0.75 100
accuracy 0.82 262
macro avg 0.81 0.80 0.80 262
weighted avg 0.82 0.82 0.81 262
F1-score: 0.747
There seems to be some room for improvement!
Find the complete code in one single file here: