简体   繁体   English

Scikit-learn 具有多个值的输入

[英]Scikit-learn Imputer with multiple values

Is there a way for a Scikit-learn Imputer to look for and replace multiple values which are considered "missing values"?有没有办法让 Scikit-learn Imputer 查找并替换被认为是“缺失值”的多个值?

For example, I would like to do something like例如,我想做类似的事情

imp = Imputer(missing_values=(7,8,9))

But according to the docs, the missing_values parameter only accepts a single integer:但是根据文档,missing_values 参数只接受一个 integer:

missing_values: integer or “NaN”, optional (default=”NaN”) missing_values:integer 或“NaN”,可选(默认=“NaN”)

The placeholder for the missing values.缺失值的占位符。 All occurrences of missing_values will be imputed.所有出现的 missing_values 都将被估算。 For missing values encoded as np.nan, use the string value “NaN”.对于编码为 np.nan 的缺失值,使用字符串值“NaN”。

Why not to do this manually in your original dataset?为什么不在您的原始数据集中手动执行此操作? Assuming you are using pd.DataFrame you can do the following:假设您使用的是pd.DataFrame ,您可以执行以下操作:

import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer

df = pd.DataFrame({'A': [1, 2, 3, 8], 'B': [1, 2, 5, 3]})
df_new = df.replace([1, 2], np.nan)
df_imp = Imputer().fit_transform(df_new)

This results in df_imp :这导致df_imp

array([[ 5.5,  4. ],
   [ 5.5,  4. ],
   [ 3. ,  5. ],
   [ 8. ,  3. ]])

If you want to make this a part of a pipeline, you would just need to implement a custom transformer with a similar logic.如果你想让它成为管道的一部分,你只需要实现一个具有类似逻辑的自定义转换器。

You could chain multiple imputers in a pipeline, but that might become hectic pretty soon and I'm not sure how efficient that is.您可以在管道中链接多个输入器,但这可能很快就会变得繁忙,我不确定它的效率如何。

pipeline = make_pipeline(
    SimpleImputer(missing_values=7, strategy='constant', fill_value=10),
    SimpleImputer(missing_values=8, strategy='constant', fill_value=10),
    SimpleImputer(missing_values=9, strategy='constant', fill_value=10)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM