简体   繁体   中英

How will the Imputers work if all the values in a column is missing in input vector in sklearn

I have a dataset with large number of columns, I have programmed my application in such a way that if any value for the given columns is missing then it would filled with imputer values with mean as the imputer strategy.

However, I am bit concerned that if all the values of the entire column is missing then how would the imputer perform, and what would be the right approach in such a case?

If in a given column, all data is missing, then the Imputer will discard that column.

Here is an example, with 4 samples and 2 columns, with one sample having a missing value:

X = np.array([[1,1],[1,2],[1,1],[1,2],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

This prints out

 [[ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1.5]]

However, if all data in the second column is missing:

X = np.array([[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

We obtain:

[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]

This default behaviour could be the right approach in that case, because this colums ( ie this feature) cannot be used anyway.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM