简体   繁体   English

如何从 sklearn 管道 output Pandas object

[英]How to output Pandas object from sklearn pipeline

I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns.我构建了一个管道,该管道采用 pandas dataframe 已拆分为分类列和数字列。 I am trying to run GridSearchCV on my results and ultimately look at the ranked features of importance for the best performing model that GridSearchCV selects.我正在尝试对我的结果运行 GridSearchCV,并最终查看 GridSearchCV 选择的性能最佳的 model 的重要性排名特征。 The problem I am encountering is that sklearn pipelines output numpy array objects and lose any column information along the way.我遇到的问题是 sklearn 管道 output numpy 数组对象并在此过程中丢失任何列信息。 Thus when I go to examine the most important coefficients of the model I am left with an unlabeled numpy array.因此,当我 go 检查 model 的最重要系数时,我留下了一个未标记的 numpy 数组。

I have read that building a custom transformer might be a possible solution to this, but I do not have any experience doing so myself.我读过构建自定义转换器可能是一个可能的解决方案,但我自己没有任何经验。 I have also looked into leveraging the sklearn-pandas package, but I am hesitant to try and implement something that might not be updated in parallel with sklearn.我也研究过利用 sklearn-pandas package,但我犹豫要不要尝试实现一些可能不会与 sklearn 并行更新的东西。 Can anyone suggest what they believe is the best path to go about getting around this issue?任何人都可以建议他们认为是解决此问题的 go 的最佳途径吗? I am also open to any literature that has hands on application of pandas and sklearn pipelines.我也对任何涉及 pandas 和 sklearn 管道应用的文献持开放态度。

My Pipeline:我的管道:

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])

Cross Validation:交叉验证:

kf = KFold(n_splits=4, shuffle=True, random_state=44)

cross_val_score(clf, X_train, y_train, cv=kf).mean()

Grid Search:网格搜索:

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

Examining Coefficients:检查系数:

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_

Here is the output of the model coefficients as it currently stands when performed on the seaborn "mpg" dataset:这是 model 系数中的 output,它在 seaborn“mpg”数据集上执行时的当前状态:

array([-4.64782052e-01,  1.47805207e+00, -3.28948689e-01, -5.37033173e+00,
        2.80000700e-01,  2.71523808e+00,  6.29170887e-01,  9.51627968e-01,
       ...
       -1.50574860e+00,  1.88477450e+00,  4.57285471e+00, -6.90459868e-01,
        5.49416409e+00])

Ideally I would like to preserve the pandas dataframe information and retrieve the derived column names after OneHotEncoder and the other methods are called.理想情况下,我想保留 pandas dataframe 信息并在调用 OneHotEncoder 和其他方法后检索派生列名称。

I would actually go for creating column names from the input.我实际上会从输入中创建列名。 If your input is already divided into numerical an categorical you can use pd.get_dummies to get the number of different category for each categorical feature.如果您的输入已经分为数字和分类,您可以使用pd.get_dummies来获取每个分类特征的不同类别的数量。

Then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.然后,您可以根据带有一些人工数据的问题,为列创建适当的名称,如本工作示例的最后一部分所示。

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']

X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])


kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ',  model.named_steps['ridge'].coef_, '\n')

# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)

print('columns after preprocessing :', columns_names_to_map,  '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))

the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value:上面的代码(最后)打印出以下数据帧,它是从参数名称到参数值的映射:

在此处输入图片说明

pip install sklearn-pandas-transformers

from sklearn_pandas_transformers.transformers import SklearnPandasWrapper从 sklearn_pandas_transformers.transformers 导入 SklearnPandasWrapper

column_transformer = ColumnTransformer(transformers=[
    ('numeric_transformer', SklearnPandasWrapper(numeric_transformer), ['num']),
    ('categorical_transformer', SklearnPandasWrapper(categorical_transformer), ['cat']),
])

I would use model.named_steps['transform'].get_feature_names_out() .我会使用model.named_steps['transform'].get_feature_names_out()

It will return the feature names like this:它将像这样返回特征名称:

array(['num__cylinders', 'num__displacement', 'num__horsepower',
       'num__weight', 'num__acceleration', 'num__model_year',
       'cat__origin_europe', 'cat__origin_japan', 'cat__origin_usa',...])

Then you can use the feature names to transform the output to a dataframe:然后您可以使用特征名称将 output 转换为 dataframe:

weights_df = pd.DataFrame(model.named_steps['ridge'].coef_,index=model.named_steps['transform'].get_feature_names_out()).T

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM