i am reading Python for Data Science for Dummies (2nd ed.), on chapter 6, at Imputing Missing Data section. The book shows a sample code using scikit-learn library.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
imp = SimpleImputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([1, 2, 3, 4, 5, 6, 7])
I tried to understand the code line-by-line so I read the fit(X)
documentation on sklearn's website. However, it just says
fit the imputer on X
and i don't understand this. then upon further reading i found this
Before you can impute anything, you must provide statistics for the Imputer to use by calling fit()
which i don't understand too
So my question is: what does the word 'statistics' mean? thanks
SimpleImputer
is used to fill nan
values based on the strategy parameter (by using the mean
or the median
feature value, the most_frequent
value or a constant
).
fit()
function will calculate the statistic that depends on your strategy.
For example, if strategy='mean'
. The fit function will calculate the mean based on the X
dataset.
Once this is done, the imputer can be used to fill value on a dataset as follow.
from sklearn.impute import SimpleImputer
import numpy as np
X_train = np.array([0,0, np.nan, 1, 1]).reshape((-1,1))
SimpleImputer(strategy='mean').fit_transform(X_train)
Output:
array([[0. ],
[0. ],
[0.5],
[1. ],
[1. ]])
Note that you can perform both fit()
and transform()
operation with fit_transform()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.