preprocessing.MinMaxScaler和preprocessing.normalize返回null的数据帧

Question

I have dataframe with floats as data, and I'd like to normalize the data, so first I convert it into int (otherwise I have error ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ) my code for normalizing: 我有一个以浮点数作为数据的数据框，我想对数据进行规范化，因此首先将其转换为int（否则将出现错误ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ）我的规范化代码：

def normalize_df():
    x = my_df.values.astype(int)
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df

And my output is 我的输出是

    0   1   2   3   4   5   6   7   8   9   ...     12  13  14  15  16  17  18  19  20  21
0   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0

What's happening (assuming that my initial dataframe contains values 0 in some rows but less than 30% of dataframe)? 发生了什么情况（假设我的初始数据框在某些行中包含值0 ，但小于数据框的30％）？ how can I fix this bug and avoid the output with zeros? 如何解决此错误并避免输出为零？

EDITED EDITED

my data looks like (there much more columns and rows): 我的数据看起来像（有更多的列和行）：

 36680            0        22498037            2266   
 0             2218        22502676               0   
 26141            0        22505885            4533   
 39009            0        22520711            4600   
 36237            0        22527171            5933

And I tried to have the values to be from 0.0 to 1.0 我尝试将值设置为0.0到1.0

Answer 1

It's not a bug, it's happening because you are trying to convert NaN values into integers, look how it works (on my machine): 这不是错误，它的发生是因为您正尝试将NaN值转换为整数，并查看其工作原理（在我的机器上）：

In [132]: a
Out[132]: array([ nan,   1.,  nan])

In [133]: a.astype(int)
Out[133]: array([-9223372036854775808,                    1, -9223372036854775808])

So each NaN is pretty small value comparing to another integers in your dataset, this causes incorrect scaling. 因此，与数据集中的另一个整数相比，每个NaN都是一个很小的值，这会导致缩放错误。

To fix this problem you should work with floats. 要解决此问题，您应该使用浮点数。 Before scaling you need to get rid of of NaN 's with some imputation, or remove such incomplete samples at all. 在缩放之前，您需要通过一些插补来消除NaN ，或者完全删除不完整的样本。 Look at sklearn.preprocessing.Imputer . 查看sklearn.preprocessing.Imputer 。

preprocessing.MinMaxScaler和preprocessing.normalize返回null的数据帧

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-10-30 10:27:29

preprocessing.MinMaxScaler和preprocessing.normalize返回null的数据帧

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-10-30 10:27:29

解决方案1
1 已采纳 2015-10-30 10:27:29