How will the Imputers work if all the values in a column is missing in input vector in sklearn

Question

I have a dataset with large number of columns, I have programmed my application in such a way that if any value for the given columns is missing then it would filled with imputer values with mean as the imputer strategy.

However, I am bit concerned that if all the values of the entire column is missing then how would the imputer perform, and what would be the right approach in such a case?

Answer 1

If in a given column, all data is missing, then the Imputer will discard that column.

Here is an example, with 4 samples and 2 columns, with one sample having a missing value:

X = np.array([[1,1],[1,2],[1,1],[1,2],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

This prints out

 [[ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1.5]]

However, if all data in the second column is missing:

X = np.array([[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

We obtain:

[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]

This default behaviour could be the right approach in that case, because this colums ( ie this feature) cannot be used anyway.

How will the Imputers work if all the values in a column is missing in input vector in sklearn

Question

1 answers

solution1
1 2016-12-26 11:32:38

How will the Imputers work if all the values in a column is missing in input vector in sklearn

Question

1 answers

solution1 1 2016-12-26 11:32:38

solution1
1 2016-12-26 11:32:38