I have a dataset with large number of columns, I have programmed my application in such a way that if any value for the given columns is missing then it would filled with imputer values with mean as the imputer strategy.
However, I am bit concerned that if all the values of the entire column is missing then how would the imputer perform, and what would be the right approach in such a case?
If in a given column, all data is missing, then the Imputer will discard that column.
Here is an example, with 4 samples and 2 columns, with one sample having a missing value:
X = np.array([[1,1],[1,2],[1,1],[1,2],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))
This prints out
[[ 1. 1. ]
[ 1. 2. ]
[ 1. 1. ]
[ 1. 2. ]
[ 1. 1.5]]
However, if all data in the second column is missing:
X = np.array([[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))
We obtain:
[[ 1.]
[ 1.]
[ 1.]
[ 1.]
[ 1.]]
This default behaviour could be the right approach in that case, because this colums ( ie this feature) cannot be used anyway.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.