Sklearn ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

Question

I fit the following pipeline classifier:

Pipeline(memory=None,steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                   ('kbest', SelectKBest(k=1218,score_func=<function mutual_info_classif at 0x7fec1e4991f0>)),
                   ('classifier',RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                           class_weight='balanced_subsample',
                                           criterion='gini', max_depth=15,
                                           max_features='log2',
                                           max_leaf_nodes=5, max_samples=0.6,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=2,
                                           min_samples_split=15,
                                           min_weight_fraction_leaf=0.0,
                                           n_estimators=50, n_jobs=None,
                                           oob_score=True, random_state=42,
                                           verbose=0, warm_start=False))],verbose=False)

It fits fine, but when I use predict on my test data I get a ValueError:

*** ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I checked for infinite and NaNs. The function raising the error is _assert_all_finite located in sklearn.utils.validation.py. I imported the function directly and ran it on the X_test array and got no errors:

from sklearn.utils import validation
validation._assert_all_finite(X_test)

How can I get an error with the exact same data when I run the predict method on the classifier? It clearly doesn't have any NaNs or Infs or it would raise an error when I directly import the function. Somewhere along the predict method, it creates those values, but I don't know when, where and why... Any help would be much appreciated!

Here's the full error message:

Traceback (most recent call last):
  File "testz.py", line 159, in <module>
    testing(dx_type, population, dx_option, feat_sel_metric, data_types, ratio_name, model_selection_metric, repo_path)
  File "testz.py", line 107, in testing
    y_test_pred=top_clf.predict(X_test)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 116, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 420, in predict
    return self.steps[-1][-1].predict(Xt, **predict_params)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 612, in predict
    proba = self.predict_proba(X)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 656, in predict_proba
    X = self._validate_X_predict(X)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 412, in _validate_X_predict
    return self.estimators_[0]._validate_X_predict(X, check_input=True)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 380, in _validate_X_predict
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 577, in check_array
    _assert_all_finite(array,
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 57, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Answer 1

It is impossible to know how you want to handle this; the problem is laid out plainly in the error ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

What we can do is make some assumptions about how you want this handled. We don't see how or where you create X_test , but, I assume it is from train_test_split , and that it is a pandas dataframe given the traceback.

So, you could do the following:

# Assumes import pandas as pd, numpy as np

# First, replace all infinity values with nan
X_train.replace([np.inf, -np.inf], np.nan), inplace=True)

# Then, replace nan values with whatever you like. This example uses 0
X_train.fillna(0, inplace=True)

# You'll probably want to repeat the same for X_Test

# First, replace all infinity values with nan
X_test.replace([np.inf, -np.inf], np.nan), inplace=True)

# Then, replace nan values with whatever you like. This example uses 0
X_test.fillna(0, inplace=True)

Sklearn ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

Question

1 answers

solution1
0 2020-06-22 18:54:13

Sklearn ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

Question

1 answers

solution1 0 2020-06-22 18:54:13

solution1
0 2020-06-22 18:54:13