I have dataframe with floats as data, and I'd like to normalize the data, so first I convert it into int (otherwise I have error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
) my code for normalizing:
def normalize_df():
x = my_df.values.astype(int)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
return df
And my output is
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
What's happening (assuming that my initial dataframe contains values 0
in some rows but less than 30% of dataframe)? how can I fix this bug and avoid the output with zeros?
EDITED
my data looks like (there much more columns and rows):
36680 0 22498037 2266
0 2218 22502676 0
26141 0 22505885 4533
39009 0 22520711 4600
36237 0 22527171 5933
And I tried to have the values to be from 0.0 to 1.0
It's not a bug, it's happening because you are trying to convert NaN
values into integers, look how it works (on my machine):
In [132]: a
Out[132]: array([ nan, 1., nan])
In [133]: a.astype(int)
Out[133]: array([-9223372036854775808, 1, -9223372036854775808])
So each NaN
is pretty small value comparing to another integers in your dataset, this causes incorrect scaling.
To fix this problem you should work with floats. Before scaling you need to get rid of of NaN
's with some imputation, or remove such incomplete samples at all. Look at sklearn.preprocessing.Imputer .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.