简体   繁体   English

如何使用 sklearn 管道跟踪 catboost 的分类索引

[英]How to track categorical indices for catboost with sklearn pipeline

I want to track categorical features indices within sklearn pipeline, in order to supply them to CatBoostClassifier.我想跟踪 sklearn 管道中的分类特征索引,以便将它们提供给 CatBoostClassifier。

I am starting with a set of categorical features before the fit() of the pipeline.我从管道的 fit() 之前的一组分类特征开始。 The pipeline itself changing the structure of the data and removing features in the feature selection step.管道本身会在特征选择步骤中改变数据结构并移除特征。

How can I know upfront which categorical features will be removed or added in the pipeline?我如何预先知道哪些分类特征将被删除或添加到管道中? I need to know the updated list indices when I call the fit() method.当我调用 fit() 方法时,我需要知道更新的列表索引。 The problem is, my dataset may change after the transformations.问题是,转换后我的数据集可能会发生变化。

Here is an example of my dataframe:这是我的数据框的示例:

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', np.nan, 'dog', 'cat', 'fish'],
                     'children': [4., 6, 3, np.nan, 2, 3, 5, 4],
                     'salary':   [90., 24, np.nan, 27, 32, 59, 36, 27],
                     'gender':   ['male', 'male', 'male', 'male', 'male', 'male', 'male', 'male'],
                     'happy':    [0, 1, 1, 0, 1, 1, 0, 0]})

categorical_features = ['pet', 'gender']
numerical_features = ['children', 'salary']
target = 'happy'

print(data)

     pet    children    salary  gender  happy
0    cat    4.0         90.0    male    0
1    dog    6.0         24.0    male    1
2    dog    3.0         NaN     male    1
3    fish   NaN         27.0    male    0
4    NaN    2.0         32.0    male    1
5    dog    3.0         59.0    male    1
6    cat    5.0         36.0    male    0
7    fish   4.0         27.0    male    0

Now I want to run a pipeline with multiple steps.现在我想运行一个包含多个步骤的管道。 One of these steps is VarianceThreshold(), which in my case, will cause "gender" to be removed from the dataframe.这些步骤之一是 VarianceThreshold(),在我的情况下,它会导致从数据框中删除“性别”。

X, y = data.drop(columns=[target]), data[target]

pipeline = Pipeline(steps=[
    (
        'preprocessing',
        ColumnTransformer(transformers=[
            (
                'categoricals',
                Pipeline(steps=[
                    ('fillna_with_frequent', SimpleImputer(strategy='most_frequent')),
                    ('ordinal_encoder', OrdinalEncoder())
                ]),
                categorical_features
            ),
            (
                'numericals',
                Pipeline(steps=[
                    ('fillna_with_mean', SimpleImputer(strategy='mean'))
                ]),
                numerical_features
            )
        ])
    ),
    (
        'feature_selection',
        VarianceThreshold()
    ),
    (
        'estimator',
        CatBoostClassifier()
    )
])

Now when I am trying to get the list of categorical features indices for CatBoost, I cannot tell that "gender" is no longer a part of my dataframe.现在,当我尝试获取 CatBoost 的分类特征索引列表时,我无法判断“性别”不再是我的数据框的一部分。

cat_features = [data.columns.get_loc(col) for col in categorical_features]
print(cat_features)
[0, 3]

The indices 0, 3 are wrong because after VarianceThreshold, feature 3 (gender) will be removed.索引 0、3 是错误的,因为在 VarianceThreshold 之后,特征 3(性别)将被删除。

pipeline.fit(X, y, estimator__cat_features=cat_features)
---------------------------------------------------------------------------
CatBoostError                             Traceback (most recent call last)
<ipython-input-230-527766a70b4d> in <module>
----> 1 pipeline.fit(X, y, estimator__cat_features=cat_features)

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    265         Xt, fit_params = self._fit(X, y, **fit_params)
    266         if self._final_estimator is not None:
--> 267             self._final_estimator.fit(Xt, y, **fit_params)
    268         return self
    269 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in fit(self, X, y, cat_features, sample_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   2801         self._fit(X, y, cat_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
   2802                   eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period,
-> 2803                   silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   2804         return self
   2805 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _fit(self, X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval, metric_period, silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model)
   1231         _check_train_params(params)
   1232 
-> 1233         train_pool = _build_train_pool(X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, column_description)
   1234         if train_pool.is_empty_:
   1235             raise CatBoostError("X is empty.")

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _build_train_pool(X, y, cat_features, pairs, sample_weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, column_description)
    689             raise CatBoostError("y has not initialized in fit(): X is not catboost.Pool object, y must be not None in fit().")
    690         train_pool = Pool(X, y, cat_features=cat_features, pairs=pairs, weight=sample_weight, group_id=group_id,
--> 691                           group_weight=group_weight, subgroup_id=subgroup_id, pairs_weight=pairs_weight, baseline=baseline)
    692     return train_pool
    693 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in __init__(self, data, label, cat_features, column_description, pairs, delimiter, has_header, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
    318                         )
    319 
--> 320                 self._init(data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
    321         super(Pool, self).__init__()
    322 

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _init(self, data, label, cat_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names)
    638             cat_features = _get_cat_features_indices(cat_features, feature_names)
    639             self._check_cf_type(cat_features)
--> 640             self._check_cf_value(cat_features, features_count)
    641         if pairs is not None:
    642             self._check_pairs_type(pairs)

~/anaconda3/lib/python3.7/site-packages/catboost/core.py in _check_cf_value(self, cat_features, features_count)
    360                 raise CatBoostError("Invalid cat_features[{}] = {} value type={}: must be int().".format(indx, feature, type(feature)))
    361             if feature >= features_count:
--> 362                 raise CatBoostError("Invalid cat_features[{}] = {} value: must be < {}.".format(indx, feature, features_count))
    363 
    364     def _check_pairs_type(self, pairs):

CatBoostError: Invalid cat_features[1] = 3 value: must be < 3.

I expect the cat_features to be [0], but the actual output is [0, 3].我希望 cat_features 为 [0],但实际输出为 [0, 3]。

您可以尝试将 cat_features 传递给 CatBoostClassifier init 函数。

The issue is not with catboost but it's how your ColumnTransformer works.问题不在于 catboost,而在于ColumnTransformer工作方式。 The columnTransfomer reconstructs the input df post-transformation in the order your transform operation columnTransfomer 按照您的转换操作的顺序重建输入 df 转换后

The underlying problem here is that transformers do not follow a predefined output schema, implying you could transform 1 column into 3 (categorical columns).这里的潜在问题是转换器不遵循预定义的输出模式,这意味着您可以将 1 列转换为 3(分类列)。

As such, you need to keep track of the number of features you're generating yourself.因此,您需要跟踪自己生成的功能数量。

My solution to this was to organize the Pipeline in such a way that I knew in advance which indexes corresponded to the categorical columns for the last step (the Catboost estimator).我对此的解决方案是以这样一种方式组织管道,即我事先知道哪些索引对应于最后一步(Catboost 估算器)的分类列。 Typically, I'd isolate and wrap all the categorical-related operations within a single transformer (you could do sub-transformations within this too), and I'd keep track of how many columns it would output.通常,我会将所有与分类相关的操作隔离并包装在单个转换器中(您也可以在其中进行子转换),并且我会跟踪它将输出多少列。 Crucially;至关重要; set this transformer as the first transformer in your pipeline.将此变压器设置为管道中的第一个变压器。 This will guarantee my first X indexes to be categorical, and I can pass this list of indexes to your catboost cat_features parameter at the end.这将保证我的第一个 X 索引是分类的,并且我可以在最后将这个索引列表传递给您的 catboost cat_features参数。

The reason you are getting an error is that your current cat_features are derived from your non_transformed dataset.您收到错误的原因是您当前的 cat_features 来自您的 non_transformed 数据集。 In order to fix this, you have to derive your cat_features after your dataset has been transformed.为了解决这个问题,您必须在转换数据集后导出 cat_features。 This is how I tracked mine: I fit the transformer to the dataset, retrieved and transformed the dataset to a pandas data frame, and then retrieved the categorical indices这就是我跟踪我的方式:我将转换器拟合到数据集,检索数据集并将其转换为熊猫数据框,然后检索分类索引

column_transform = ColumnTransformer([('n', MinMaxScaler(), numerical_idx)], remainder='passthrough')
scaled_X = column_transform.fit_transform(X)
new_df = pd.DataFrame(scaled_X)
new_df = new_df.infer_objects() # converts the datatype to their most accurate datatype
cat_features_new = [new_df.columns.get_loc(col) for col in new_df.select_dtypes(include=['object', 'bool']).columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM