简体   繁体   English

sklearn 估算行满足条件

[英]sklearn impute rows satisfying condition

I'm trying to use sklearn SimpleImputer to impute missing ages from a particular column in a pandas DataFrame containing Titanic data.我正在尝试使用 sklearn SimpleImputer 从包含泰坦尼克号数据的 pandas DataFrame 中的特定列中估算缺失的年龄。 However, I want to separately impute the missing values for passengers whose names contain the word "Master" using the average of the other Master's ages.但是,我想使用其他 Master 年龄的平均值分别估算姓名中包含“Master”一词的乘客的缺失值。

I tried locating that data, and treating it separately:我尝试找到该数据,并分别处理:

imputer = SimpleImputer(strategy="mean")

# Copy data
imputed_X = X.copy()

# Get data for "masters"
masters = imputed_X[imputed_X['Name'].str.contains("Master")]

# Get imputed version of Age column
masters_age_imputed = pd.DataFrame(imputer.fit_transform(masters[["Age"]]))
masters_age_imputed.index = masters.index
# (So far so good... the missing values have been replaced with the average)

# But putting those values back into the DataFrame doesn't work:
imputed_X.loc[X['Name'].str.contains("Master"),"Age"] = masters_age_imputed

Instead of imputing all of the missing Master's ages with the average age, this deletes all of the non-missing ages, and replaces them with NaN.这不是将所有缺失的 Master 的年龄归因于平均年龄,而是删除所有非缺失的年龄,并用 NaN 替换它们。

Is there a better way of doing this?有更好的方法吗? Eg, one that works?例如,一个有效的? Aside from setting up my own for loop and replacing everything manually?除了设置我自己的 for 循环并手动替换所有内容之外?

You need to fit first and use that to transform the data.您需要先适应并使用它来转换数据。 When you fit the imputer you use the column along with the missing values.当您拟合输入时,您使用该列以及缺失值。 Use that fitted model to impute the missing values using transform as i used below.使用拟合 model 来使用我在下面使用的变换来估算缺失值。

Can you try this?你能试试这个吗?

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp = imp.fit(imputed_X[['Age']])
imputed_X['Age'] = imp.transform(imputed_X[['Age']]).ravel()

If you want to only impute a subset of the data, (column name contains master for eg) then you can do that and merge it back to the original dataframe. You do not need looping, you can pd.merge it back.如果您只想估算数据的一个子集(column name contains master for eg)那么您可以这样做并将其merge回原始 dataframe。您不需要循环,您可以pd.merge它回来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM