简体   繁体   English

preprocessing.MinMaxScaler和preprocessing.normalize返回null的数据帧

[英]preprocessing.MinMaxScaler and preprocessing.normalize return dataframe of Nulls

I have dataframe with floats as data, and I'd like to normalize the data, so first I convert it into int (otherwise I have error ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ) my code for normalizing: 我有一个以浮点数作为数据的数据框,我想对数据进行规范化,因此首先将其转换为int(否则将出现错误ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). )我的规范化代码:

def normalize_df():
    x = my_df.values.astype(int)
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df

And my output is 我的输出是

    0   1   2   3   4   5   6   7   8   9   ...     12  13  14  15  16  17  18  19  20  21
0   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0

What's happening (assuming that my initial dataframe contains values 0 in some rows but less than 30% of dataframe)? 发生了什么情况(假设我的初始数据框在某些行中包含值0 ,但小于数据框的30%)? how can I fix this bug and avoid the output with zeros? 如何解决此错误并避免输出为零?

EDITED EDITED

my data looks like (there much more columns and rows): 我的数据看起来像(有更多的列和行):

 36680            0        22498037            2266   
 0             2218        22502676               0   
 26141            0        22505885            4533   
 39009            0        22520711            4600   
 36237            0        22527171            5933   

And I tried to have the values to be from 0.0 to 1.0 我尝试将值设置为0.0到1.0

It's not a bug, it's happening because you are trying to convert NaN values into integers, look how it works (on my machine): 这不是错误,它的发生是因为您正尝试将NaN值转换为整数,并查看其工作原理(在我的机器上):

In [132]: a
Out[132]: array([ nan,   1.,  nan])

In [133]: a.astype(int)
Out[133]: array([-9223372036854775808,                    1, -9223372036854775808])

So each NaN is pretty small value comparing to another integers in your dataset, this causes incorrect scaling. 因此,与数据集中的另一个整数相比,每个NaN都是一个很小的值,这会导致缩放错误。

To fix this problem you should work with floats. 要解决此问题,您应该使用浮点数。 Before scaling you need to get rid of of NaN 's with some imputation, or remove such incomplete samples at all. 在缩放之前,您需要通过一些插补来消除NaN ,或者完全删除不完整的样本。 Look at sklearn.preprocessing.Imputer . 查看sklearn.preprocessing.Imputer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python / sklearn - preprocessing.MinMaxScaler 1d弃用 - Python/sklearn - preprocessing.MinMaxScaler 1d deprecation preprocessing.MinMaxScaler().fit_transform() 上的值错误 - Value Error on preprocessing.MinMaxScaler().fit_transform() 如何使用 sklearn.preprocessing.normalize 规范化 DataFrame 的列? - How to normalize the columns of a DataFrame using sklearn.preprocessing.normalize? 无法“从 sklearn.preprocessing 导入 MinMaxScaler”导入 MinMaxScaler - Can't import MinMaxScaler “from sklearn.preprocessing import MinMaxScaler” sklearn.preprocessing.normalize 中的规范参数 - norm parameters in sklearn.preprocessing.normalize 使用 NLTK 预处理存储在 DataFrame 中的语料库 - Preprocessing corpus stored in DataFrame with NLTK pytorch中自定义数据集的数据预处理(transform.Normalize) - Data preprocessing for custom dataset in pytorch (transform.Normalize) sklearn.preprocessing.normalize考虑哪个L1规范? - Which L1 norm does sklearn.preprocessing.normalize consider? 预处理管道错误:给定列不是 dataframe 的列 - preprocessing pipeline error: a given column is not a column of the dataframe 从 Spark dataframe 上的 pandas 执行预处理操作 - Perform preprocessing operations from pandas on Spark dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM