简体   繁体   English

科学工具学习错误地估算值

[英]Sci-kit learn imputing values incorrectly

I am using Scikit-learn to impute missing values for my data set, but looking at the largest values for one of my features in the data set it is clear that these missing values are being imputed incorrectly. 我正在使用Scikit-learn估算数据集的缺失值,但是查看数据集中我的一项功能的最大值,很显然,这些缺失值的估算不正确。 First I use a pandas function to see the largest 10 values for a feature in my data set 首先,我使用pandas函数查看数据集中某个功能的最大10个值

 ofData = mergeData.iloc[:, 3]
 print ofData.nlargest(10)

The output of this is, 这样的输出是

 124    4.0
 128    4.0
 146    4.0
 147    4.0
 177    4.0
 240    4.0
 253    4.0
 310    4.0
 360    4.0
 361    4.0

Which is correct I know this to be the max possible value for this feature. 正确的我知道这是此功能的最大可能值。 Then I impute the data with Scikit learn. 然后我用Scikit学习数据。

 imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
 nData = imp.fit_transform(mergeData)
 nData = pd.DataFrame(nData)

Then I once again use pandas to see the largest 10 values for this feature. 然后,我再次使用pandas来查看此功能的最大10个值。

 ofData = nData.iloc[:, 3]
 print ofData.nlargest(10)

Which outputs, 哪个输出,

 1030    77.571129
 1056    67.804684
 1308    62.780544
 1212    61.902375
 927     61.207525
 870     60.592999
 1100    55.604145
 1722    55.308159
 1415    52.637559
 72      49.940297

These values are clearly not the mean of that feature since they are all larger than the maximum values from before imputation. 这些值显然不是该特征的平均值,因为它们都比插补前的最大值大。 I'm completely lost on what could be causing this and am worried it could be affecting the imputation of other features in my data set as well. 我完全不知道这可能是什么原因,并且担心它也可能影响我数据集中其他功能的推定。

既然你要在该中的平均替换缺失值,轴必须为0(这是默认值),而不是1您的代码替换成一的平均遗漏值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM