为测试集估算缺失值

Question

I used adult data here , impute missing values for training data, while I want to apply the same number I get from training data to test data. 我在这里使用了成人数据，为训练数据估算了缺失值，而我想将从训练数据中获得的相同数字应用于测试数据。 I must miss something and cannot get it right. 我必须错过一些东西，不能做对。 My code is as following: 我的代码如下：

import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

features = ['age','workclass','fnlwgt','education','educationNum','maritalStatus','occupation','relationship','race','sex','capitalGain','capitalLoss','hoursPerWeek','nativeCountry']

x_train = train[list(features)]
y_train = train['class']
x_test = test[list(features)]
y_test = test['class']

class DataFrameImputer(TransformerMixin):
    def _init_(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value in column.
        columns of other types are imputed with mean of column"""
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
                               if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
                              index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)


# 2 step transformation, fit and transform
# -------Impute missing values-------------

x_train = pd.DataFrame(x_train)  # x_train is class
x_test = pd.DataFrame(x_test)
x_train_new = DataFrameImputer().fit_transform(x_train)
x_train_new = pd.DataFrame(x_train_new)
# use same value fitted training data to fit test data

for c in x_test:
    if x_test[c].dtype==np.dtype('O'):
        x_test.fillna(x_train[c].value_counts().index[0])
    else:
        x_test.fillna(x_train[c].mean(),inplace=True)

Answer 1

We want to use what we get from training data, apply it to test data, in the previous piece of code, the loop doesn't work, the first column is a column of numbers, so it will fill all the NaNs in the test data as the mean of the first column of training data. 我们想使用从训练数据中获得的信息，将其应用于测试数据，在前面的代码中，循环不起作用，第一列是一列数字，因此它将填充测试中的所有NaN数据作为训练数据第一列的平均值。 Instead, if I use fillna with values, here values is a dictionary, test data will match training data according to categories. 相反，如果我使用带值的fillna，这里的值是一个字典，则测试数据将根据类别匹配训练数据。

values = {} #declare dict
for c in x_train:
    if x_train[c].dtype==np.dtype('O'):
        values[c]=x_train[c].value_counts().index[0]
    else:
        values[c]=x_train[c].mean()
    values.update({c:values[c]})

x_test_new = x_test.fillna(value=values)

为测试集估算缺失值

问题描述

1 个解决方案

解决方案1
-1 2018-01-29 06:44:46

为测试集估算缺失值

问题描述

1 个解决方案

解决方案1 -1 2018-01-29 06:44:46

解决方案1
-1 2018-01-29 06:44:46