preprocessing.MinMaxScaler and preprocessing.normalize return dataframe of Nulls

Question

I have dataframe with floats as data, and I'd like to normalize the data, so first I convert it into int (otherwise I have error ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). ) my code for normalizing:

def normalize_df():
    x = my_df.values.astype(int)
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df

And my output is

    0   1   2   3   4   5   6   7   8   9   ...     12  13  14  15  16  17  18  19  20  21
0   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0

What's happening (assuming that my initial dataframe contains values 0 in some rows but less than 30% of dataframe)? how can I fix this bug and avoid the output with zeros?

EDITED

my data looks like (there much more columns and rows):

 36680            0        22498037            2266   
 0             2218        22502676               0   
 26141            0        22505885            4533   
 39009            0        22520711            4600   
 36237            0        22527171            5933

And I tried to have the values to be from 0.0 to 1.0

Answer 1

It's not a bug, it's happening because you are trying to convert NaN values into integers, look how it works (on my machine):

In [132]: a
Out[132]: array([ nan,   1.,  nan])

In [133]: a.astype(int)
Out[133]: array([-9223372036854775808,                    1, -9223372036854775808])

So each NaN is pretty small value comparing to another integers in your dataset, this causes incorrect scaling.

To fix this problem you should work with floats. Before scaling you need to get rid of of NaN 's with some imputation, or remove such incomplete samples at all. Look at sklearn.preprocessing.Imputer .

preprocessing.MinMaxScaler and preprocessing.normalize return dataframe of Nulls

Question

1 answers

solution1
1 ACCPTED 2015-10-30 10:27:29

preprocessing.MinMaxScaler and preprocessing.normalize return dataframe of Nulls

Question

1 answers

solution1 1 ACCPTED 2015-10-30 10:27:29

solution1
1 ACCPTED 2015-10-30 10:27:29