
[英]How do I combine different kind of missing_values in sklearn's SimpleImputer into one
[英]How to use SimpleImputer class to impute missing values in different columns with different constant values?
我正在使用sklearn.impute.SimpleImputer(strategy='constant',fill_value= 0)
将所有缺失值的列sklearn.impute.SimpleImputer(strategy='constant',fill_value= 0)
一个常量值(0 是这里的常量值)。
但是,有时在不同的列中插补不同的常量值是有意义的。 例如,我可能想用该列的最大值替换某个列的所有NaN
值,或者用最小值替换某个其他列的NaN
值,或者假设该特定列值的中值/平均值。
我怎样才能做到这一点?
另外,我实际上是这个领域的新手,所以我不确定这样做是否可以改善我的模型结果。 欢迎您提出意见。
如果你想用不同的任意值或中值来估算不同的特征,你需要在管道中设置几个 SimpleImputer 步骤,然后将它们与 ColumnTransformer 连接起来:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# first we need to make lists, indicating which features
# will be imputed with each method
features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu']
# then we instantiate the imputers, within a pipeline
# we create one imputer for numerical and one imputer
# for categorical
# this imputer imputes with the mean
imputer_numeric = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
])
# this imputer imputes with an arbitrary value
imputer_categoric = Pipeline(
steps=[('imputer',
SimpleImputer(strategy='constant', fill_value='Missing'))])
# then we put the features list and the transformers together
# using the column transformer
preprocessor = ColumnTransformer(transformers=[('imputer_numeric',
imputer_numeric,
features_numeric),
('imputer_categoric',
imputer_categoric,
features_categoric)])
# now we fit the preprocessor
preprocessor.fit(X_train)
# and now we can impute the data
# remember it returs a numpy array
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)
或者,您可以使用 Feature-Engine 包,其中转换器允许您指定功能:
from feature_engine import imputation as msi
from sklearn.pipeline import Pipeline as pipe
pipe = pipe([
# add a binary variable to indicate missing information for the 2 variables below
('continuous_var_imputer', msi.AddMissingIndicator(variables = ['LotFrontage', 'GarageYrBlt'])),
# replace NA by the median in the 3 variables below, they are numerical
('continuous_var_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea'])),
# replace NA by adding the label "Missing" in categorical variables (transformer will skip those variables where there is no NA)
('categorical_imputer', msi.CategoricalImputer(variables = ['var1', 'var2'])),
# median imputer
# to handle those, I will add an additional step here
('additional_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['var4', 'var5'])),
])
pipe.fit(X_train)
X_train_t = pipe.transform(X_train)
特征引擎返回数据帧。 此链接中的更多信息。
要安装功能引擎,请执行以下操作:
pip install feature-engine
希望有帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.