简体   繁体   中英

sklearn.impute fit() function

i am reading Python for Data Science for Dummies (2nd ed.), on chapter 6, at Imputing Missing Data section. The book shows a sample code using scikit-learn library.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

s = pd.Series([1, 2, 3, np.NaN, 5, 6, None])
imp = SimpleImputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([1, 2, 3, 4, 5, 6, 7])

I tried to understand the code line-by-line so I read the fit(X) documentation on sklearn's website. However, it just says

fit the imputer on X

and i don't understand this. then upon further reading i found this

Before you can impute anything, you must provide statistics for the Imputer to use by calling fit()

which i don't understand too

So my question is: what does the word 'statistics' mean? thanks

SimpleImputer is used to fill nan values based on the strategy parameter (by using the mean or the median feature value, the most_frequent value or a constant ).

fit() function will calculate the statistic that depends on your strategy.

For example, if strategy='mean' . The fit function will calculate the mean based on the X dataset.

Once this is done, the imputer can be used to fill value on a dataset as follow.

from sklearn.impute import SimpleImputer
import numpy as np
X_train = np.array([0,0, np.nan, 1, 1]).reshape((-1,1))

SimpleImputer(strategy='mean').fit_transform(X_train)

Output:

array([[0. ],
       [0. ],
       [0.5],
       [1. ],
       [1. ]])

Note that you can perform both fit() and transform() operation with fit_transform() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM