简体   繁体   中英

preprocessing.MinMaxScaler and preprocessing.normalize return dataframe of Nulls

I have dataframe with floats as data, and I'd like to normalize the data, so first I convert it into int (otherwise I have error ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ) my code for normalizing:

def normalize_df():
    x = my_df.values.astype(int)
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df

And my output is

    0   1   2   3   4   5   6   7   8   9   ...     12  13  14  15  16  17  18  19  20  21
0   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0

What's happening (assuming that my initial dataframe contains values 0 in some rows but less than 30% of dataframe)? how can I fix this bug and avoid the output with zeros?

EDITED

my data looks like (there much more columns and rows):

 36680            0        22498037            2266   
 0             2218        22502676               0   
 26141            0        22505885            4533   
 39009            0        22520711            4600   
 36237            0        22527171            5933   

And I tried to have the values to be from 0.0 to 1.0

It's not a bug, it's happening because you are trying to convert NaN values into integers, look how it works (on my machine):

In [132]: a
Out[132]: array([ nan,   1.,  nan])

In [133]: a.astype(int)
Out[133]: array([-9223372036854775808,                    1, -9223372036854775808])

So each NaN is pretty small value comparing to another integers in your dataset, this causes incorrect scaling.

To fix this problem you should work with floats. Before scaling you need to get rid of of NaN 's with some imputation, or remove such incomplete samples at all. Look at sklearn.preprocessing.Imputer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM