Sklearn ValueError：输入包含 NaN、无穷大或对于 dtype（'float32'）来说太大的值

Question

I fit the following pipeline classifier:我适合以下管道分类器：

Pipeline(memory=None,steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                   ('kbest', SelectKBest(k=1218,score_func=<function mutual_info_classif at 0x7fec1e4991f0>)),
                   ('classifier',RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                           class_weight='balanced_subsample',
                                           criterion='gini', max_depth=15,
                                           max_features='log2',
                                           max_leaf_nodes=5, max_samples=0.6,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=2,
                                           min_samples_split=15,
                                           min_weight_fraction_leaf=0.0,
                                           n_estimators=50, n_jobs=None,
                                           oob_score=True, random_state=42,
                                           verbose=0, warm_start=False))],verbose=False)

It fits fine, but when I use predict on my test data I get a ValueError:它很合适，但是当我对我的测试数据使用预测时，我得到一个 ValueError：

*** ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). *** ValueError：输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值。

I checked for infinite and NaNs.我检查了无限和NaN。 The function raising the error is _assert_all_finite located in sklearn.utils.validation.py.引发错误的 function 是位于 sklearn.utils.validation.py 中的 _assert_all_finite。 I imported the function directly and ran it on the X_test array and got no errors:我直接导入了 function 并在 X_test 数组上运行它，没有错误：

from sklearn.utils import validation
validation._assert_all_finite(X_test)

How can I get an error with the exact same data when I run the predict method on the classifier?当我在分类器上运行预测方法时，如何得到完全相同数据的错误？ It clearly doesn't have any NaNs or Infs or it would raise an error when I directly import the function.它显然没有任何 NaN 或 Infs，否则当我直接导入 function 时会引发错误。 Somewhere along the predict method, it creates those values, but I don't know when, where and why... Any help would be much appreciated!在 predict 方法的某个地方，它会创建这些值，但我不知道何时、何地以及为什么......任何帮助将不胜感激！

Here's the full error message:这是完整的错误消息：

Traceback (most recent call last):
  File "testz.py", line 159, in <module>
    testing(dx_type, population, dx_option, feat_sel_metric, data_types, ratio_name, model_selection_metric, repo_path)
  File "testz.py", line 107, in testing
    y_test_pred=top_clf.predict(X_test)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 116, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/pipeline.py", line 420, in predict
    return self.steps[-1][-1].predict(Xt, **predict_params)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 612, in predict
    proba = self.predict_proba(X)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 656, in predict_proba
    X = self._validate_X_predict(X)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 412, in _validate_X_predict
    return self.estimators_[0]._validate_X_predict(X, check_input=True)
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/tree/_classes.py", line 380, in _validate_X_predict
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 577, in check_array
    _assert_all_finite(array,
  File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sklearn/utils/validation.py", line 57, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Answer 1

It is impossible to know how you want to handle this;不可能知道您想如何处理它； the problem is laid out plainly in the error ValueError: Input contains NaN, infinity or a value too large for dtype('float32').问题在错误ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

What we can do is make some assumptions about how you want this handled.我们可以做的是对您希望如何处理这个问题做出一些假设。 We don't see how or where you create X_test , but, I assume it is from train_test_split , and that it is a pandas dataframe given the traceback.我们看不到您创建X_test的方式或位置，但是，我假设它来自train_test_split ，并且它是pandas dataframe给定回溯。

So, you could do the following:因此，您可以执行以下操作：

# Assumes import pandas as pd, numpy as np

# First, replace all infinity values with nan
X_train.replace([np.inf, -np.inf], np.nan), inplace=True)

# Then, replace nan values with whatever you like. This example uses 0
X_train.fillna(0, inplace=True)

# You'll probably want to repeat the same for X_Test

# First, replace all infinity values with nan
X_test.replace([np.inf, -np.inf], np.nan), inplace=True)

# Then, replace nan values with whatever you like. This example uses 0
X_test.fillna(0, inplace=True)

Sklearn ValueError：输入包含 NaN、无穷大或对于 dtype（'float32'）来说太大的值

问题描述

1 个解决方案

解决方案1
0 2020-06-22 18:54:13

Sklearn ValueError：输入包含 NaN、无穷大或对于 dtype（'float32'）来说太大的值

问题描述

1 个解决方案

解决方案1 0 2020-06-22 18:54:13

解决方案1
0 2020-06-22 18:54:13