简体   繁体   中英

How to solve the error "value too large for dtype('float32')?"

I read many questions similar to this but still can not figure this out.

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

X_to_predict = array([[  1.37097033e+002,   0.00000000e+000,  -1.82710826e+296,
          1.22703799e+002,   1.37097033e+002,  -2.56391552e+001,
          1.11457878e+002,   1.37097033e+002,  -2.56391552e+001,
          9.81898928e+001,   1.22703799e+002,  -2.45139066e+001,
          9.24341823e+001,   1.11457878e+002,  -1.90236954e+001]])

clf.predict_proba(X_to_predict)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

My issue is neither nan nor inf values since:

np.isnan(X_to_predict).sum()
Out[147]: 0

np.isinf(X_to_predict).sum()
Out[148]: 0

Question: How can I convert X_to_predict to values that are not too large for float32 while keeping as many digits after decimal point as possible?

If you inspect the dtype of your array X_to_predict it should show float64 .

# slightly modified array from the question
X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
                         1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
                         1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
                         9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))

print(X_to_predict.dtype)
>>> float64

sklearn's RandomForestClassifier silently converts the array to float32 , see the discussion here for the origin of the error message.

You can convert it yourself

print(X_to_predict.astype(np.float32)))

>>> array([[137.09703 ,   0.      ,       -inf, 122.7038  ],
           [137.09703 , -25.639154, 111.45788 , 137.09703 ],
           [-25.639154,  98.189896, 122.7038  , -24.513906]], 
          dtype=float32)

The third value (-1.82710826e+296) becomes -inf in float32. The only way around it is to replace your inf values with the maximum of float32. You will lose some precision, as far as I know there is currently no parameter or workaround, except for changing the implementation in sklearn and recompiling it.

If you use np.nan_to_num your array should look like this:

new_X = np.nan_to_num(X_to_predict.astype(np.float32))
print(new_X)

>>> array([[ 1.3709703e+02,  0.0000000e+00, -3.4028235e+38,  1.2270380e+02],
           [ 1.3709703e+02, -2.5639154e+01,  1.1145788e+02,  1.3709703e+02],
           [-2.5639154e+01,  9.8189896e+01,  1.2270380e+02, -2.4513906e+01]],
          dtype=float32)

which should be accepted by your classifier.


Complete code

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
clf = RandomForestClassifier(n_estimators=10,
                             random_state=42)
clf.fit(iris.data, iris.target)

X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
                         1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
                         1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
                         9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))

print(X_to_predict.dtype)

print(X_to_predict.astype(np.float32))

new_X = np.nan_to_num(X_to_predict.astype(np.float32))

print(new_X)

#should return array([2, 2, 0])
print(clf.predict(new_X))



# should crash
clf.predict(X_to_predict)

This error is quite misleading at times. if you have blank values in the data set ( which means certain features in the dataset have blank values)even then you can get this type of error. How do we resolve this ...

Covert the dataframe and export them into csv. below is the code "df" is the dataframe Dataframe to CSV compression_opts = dict(method='zip',archive_name='out.csv') df.to_csv('out.zip', index=False, compression=compression_opts) You can also try this

df[df['column_name'] == ''].index

Identify the features which have blank values by analyzing the output CSV.

Remove the complete record which have blank values, through the below code df = df.dropna(subset=['column_name'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM