简体   繁体   中英

Soft-impute on the test set with fancyimpute

The python package fancyimpute provides several data imputation methods. I have tried to use the soft-impute approach; however, soft-impute doesn't offer a transform method to be used on the test dataset. More precisely, Sklearn SimpleImputer (for example below) provides fit, transform and fit_transform methods. On the other hand, SoftImpute provides the only fit_transform, which allows me to fit the data on training but not transform it into the testing set. I understand that fitting the imputation on the training and testing sets will cause data-leak from the testing set into the training. To this end, we need to fit on the training and transform on testing. Are there any ways of imputing the test set of what I fitted from the training set in soft-impute approach?. I appreciate any thoughts.

    # this example from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
    import numpy as np 
    from sklearn.impute import SimpleImputer
    imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

    X_train = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
    print(imp_mean.transform(X_train))
    # SimpleImputer provides transform method, so we can apply fitted imputation into the 
    testing data e.g.
    # X_test =[...]
    # print(imp_mean.transform(X_test))

   from fancyimpute import SoftImpute
   clf = SoftImpute(verbose=True)
   clf.fit_transform(X_train)
   ## There is no clf.tranform to be used with test set e.g. clf.transform(X_test)


Fancy impute doesn't support inductive mode. The important thing here is to fill in the training data without using test data. I think you can impute test data using imputed training data. Sample code:

len_train_data=train_df.shape[0]<br>
imputer=SoftImpute() <br>  
#impute train data  <br>    
X_train_fill_SVD = imputer.fit_transform(train_df)<br>
X_train_fill_SVD=pd.DataFrame(X_train_fill_SVD)<br>
#concat imputed train and test<br>
Concat_data=pd.concat((X_train_fill_SVD,test_df),axis=0)<br>
Concat_data=imputer.fit_transform(Concat_data)<br>
Concat_data=pd.DataFrame(Concat_data)<br>
#fetch imputed test data  <br>
X_test_fill_SVD=Concat_data.iloc[len_train_data:,]<br>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM