简体   繁体   English

SKlearn输入中的随机森林错误

[英]SKlearn Random Forest error on input

I am trying to run the fit for my random forest, but I am getting the following error: 我正在尝试为我的随机森林运行,但出现以下错误:

forest.fit(train[features], y)

returns 回报

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)

/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    210         """
    211         # Validate or convert input data
--> 212         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    213         if issparse(X):
    214             # Pre-sort indices to avoid that each individual tree of the

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    396                              % (array.ndim, estimator_name))
    397         if force_all_finite:
--> 398             _assert_all_finite(array)
    399 
    400     shape_repr = _shape_repr(array.shape)

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
     52             and not np.isfinite(X).all()):
     53         raise ValueError("Input contains NaN, infinity"
---> 54                          " or a value too large for %r." % X.dtype)
     55 
     56 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I have coerced my dataframe from float64 to float32 for my features and made sure that there's no nulls so not sure what's throwing this error. 我已将我的数据帧从float64强制转换为float32以实现我的功能,并确保没有null,因此不确定引发此错误的原因。 Let me know if it would be helpful to put in more of my code. 让我知道插入更多代码是否有帮助。

UPDATE UPDATE

It's originally a pandas dataframe, which I dropped all NaNs. 它最初是一个熊猫数据框,我删除了所有NaN。 The original dataframe is survey results with respondent information, where I dropped all questions except for my dv. 最初的数据框是带有答复者信息的调查结果,在这里我删除了除DVD之外的所有问题。 I double checked this by running rforest_df.isnull().sum() which returned 0. This is the full code I used for the modeling. 我通过运行rforest_df.isnull().sum()进行了仔细检查,该rforest_df.isnull().sum()返回0。这是我用于建模的完整代码。

rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)

Update 更新

This is what y data looks like 这就是y数据的样子

array([ 0,  1,  2,  3,  4,  3,  3,  5,  6,  7,  8,  7,  9,  6, 10,  6, 11,
        7, 11,  3,  7,  9,  6,  5,  9, 11, 12, 13,  6, 11,  3,  3,  6, 14,
       15,  0,  9,  9,  2,  0, 11,  3,  9,  4,  9,  7,  3,  4,  9, 12,  9,
        7,  6, 13,  6,  0,  0, 16,  6, 11,  4, 10, 11, 11, 17,  3,  6, 16,
        3,  4, 18, 19,  7, 11,  5, 11,  5,  4,  0,  6, 17,  7,  2,  3,  5,
       11,  8,  9, 18,  6,  9,  8,  5, 16, 20,  0,  4,  8, 13, 16,  3, 20,
        0,  5,  4,  2, 11,  0,  3,  0,  6,  6,  6,  9,  4,  6,  5, 11,  0,
       13,  6,  2, 11,  7,  5,  6, 18, 12, 21, 17,  3,  6,  0, 13, 21,  7,
        3,  2, 18, 22,  7,  3,  2,  6,  7,  8,  4,  0,  7, 12,  3,  7,  3,
        2, 11, 19, 11,  6,  2,  9,  3,  7,  9,  9,  5,  6,  8,  0, 18, 11,
        3, 12,  2,  6,  4,  7,  7, 11,  3,  6,  6,  0,  6, 12, 15,  3,  9,
        3,  3,  0,  5,  9,  7,  9, 11,  7,  3, 20,  0,  7,  6,  6, 23, 15,
       19,  0,  3,  6, 16, 13,  5,  6,  6,  3,  6, 11,  9,  0,  6, 23, 16,
        4,  0,  6, 17, 11, 17, 11,  4,  3, 13,  3, 17, 16, 11,  7,  4, 24,
        5,  2,  7,  7,  8,  3,  3, 11,  8,  7, 23,  7,  7, 11,  7, 11,  6,
       15,  3, 25,  7,  4,  5,  3, 17, 20,  3, 26,  7,  9,  6,  6, 17, 20,
        1,  0, 11,  9, 16, 20,  7,  7, 26,  3,  6, 20,  7,  2, 11,  7, 27,
        9,  4, 26, 28,  8,  6,  9, 19,  7, 29,  3,  2, 26, 30,  6, 31,  6,
       18,  3,  0, 18,  4,  7, 32,  0,  2,  8,  0,  5,  9,  4, 16,  6, 23,
        0,  7,  0,  7,  9,  6,  8,  3,  7,  9,  3,  3, 12, 11,  8, 19, 20,
        7,  3,  5, 11,  3, 11,  8,  4,  4,  6,  9,  4,  1,  3,  0,  9,  9,
        6,  7,  8, 33,  8,  7,  9, 34, 11, 11,  6,  9,  9, 17,  8, 19,  0,
        7,  4, 17,  6,  7,  0,  4, 12,  7,  6,  4, 16, 12,  9,  6,  6,  6,
        6, 26, 13,  9,  7,  2,  7,  3, 11,  3,  6,  7, 19,  4,  8,  9, 13,
       11, 15, 11,  4, 18,  7,  7,  7,  0,  5,  4,  6,  0,  3,  7,  4, 25,
       18,  6, 19,  7,  9,  4, 20,  6,  3,  7,  4, 35, 15, 11,  2, 12,  0,
        7, 32,  6, 18,  9,  9,  6,  2,  3, 19, 36, 32,  0,  7,  0,  9, 37,
        3,  5,  6,  5, 34,  2,  6,  0,  7,  0,  7,  3,  7,  4, 18, 18,  7,
        3,  7, 16,  9, 19, 13,  4, 16, 19,  3, 19, 38,  9,  4,  9,  8,  0,
       17,  0,  2,  3,  5,  6,  5, 11, 11,  2,  9,  5, 33,  9,  5,  6, 20,
       13,  3, 39, 13,  7,  0,  9,  0,  4,  6,  7, 16,  7,  0, 21,  5,  3,
       18,  5, 20,  2,  2, 14,  6, 17, 11, 11, 16, 16,  9,  8, 11,  3, 23,
        0, 11,  0,  6,  0,  0,  3, 16,  6,  7,  5,  9,  7, 13,  0, 20,  0,
       25,  6, 16,  8,  4,  4,  2,  8,  7,  5, 40,  3,  8,  5, 12,  8,  9,
        6,  6,  6,  6,  3,  7, 26,  4,  0, 13,  4,  3, 13, 12,  7,  7,  6,
        7, 19, 15,  0, 33,  4,  5,  5, 20,  3, 11,  5,  4,  7,  9,  7, 11,
       36,  9,  0,  6,  6, 11,  6,  4,  2,  5, 18,  8,  5,  5,  2, 25,  4,
       41,  7,  7,  5,  7,  3, 36, 11,  6,  9,  0,  9,  0, 16, 42, 11, 11,
       18,  9,  5, 36,  2,  9,  6,  3, 43,  9, 17, 13,  5,  9,  3,  4,  6,
       44, 37,  0, 45,  2, 18,  8, 46,  2, 12,  9,  9,  3, 16,  6, 12,  9,
        0, 11, 11,  0, 25,  8, 17,  4,  4,  3, 11,  3, 11,  6,  6,  9,  7,
       23,  0,  2,  0,  3,  3,  4,  4,  9,  5, 11, 16,  7,  3, 18, 11,  7,
        6,  6,  6,  5,  9,  6,  3,  9,  7, 17, 11,  4,  9,  2,  3,  0, 26,
        9,  0, 20,  8,  9,  6, 11,  6,  6,  7, 26,  6,  6,  4, 19,  5, 41,
       19, 18, 29,  6,  5, 13,  6, 11,  7,  7,  6,  8,  5,  0,  3, 13, 17,
        6, 20, 11,  6,  9,  6,  2,  7, 11,  9, 20, 12,  7,  6,  8,  7,  4,
        6,  2,  0,  7,  9, 26,  9, 16,  7,  4, 45,  7,  0, 23,  8,  4, 19,
        4, 26, 11,  4,  4,  5,  7,  3,  0, 29, 12,  3,  4, 11,  4, 12,  8,
        7,  5,  0, 47, 12,  0, 25,  6, 16, 20,  5,  8,  4,  4, 11, 12,  0,
        6,  3, 11,  4,  3, 48,  3,  6,  7,  4,  7,  0,  3,  7,  3, 18,  6,
        2,  9,  9, 11,  3,  9,  6, 18, 16,  6, 34,  2,  7,  4,  3, 45,  5,
        0,  7,  2, 17, 17,  9, 18,  5,  6,  5, 15,  5,  7,  6,  9,  0,  7,
       12, 17])

I would first recommend that you check the datatype for each column within your train[features] df in the following way: 我首先建议您以以下方式检查train [features] df中每一列的数据类型:

print train[features].dtypes 

If you see there are columns that are non-numeric, you can inspect these columns to ensure there are not any unexpected values (eg strings, NaNs, etc.) that would cause problems. 如果看到有非数字列,则可以检查这些列,以确保没有任何会引起问题的意外值(例如,字符串,NaN等)。 If you do not mind dropping out non-numeric columns, you can simply select all the numeric columns with the following: 如果您不介意删除非数字列,则只需选择以下所有数字列即可:

numeric_cols = X.select_dtypes(include=['float64','float32']).columns

You can also add in columns with int dtypes if you would like. 如果愿意,还可以添加具有int dtypes的列。

If you are encountering values that are too large or too small for the model to handle, it is a sign that scaling the data is a good idea. 如果遇到太大或太小而模型无法处理的值,则表明缩放数据是个好主意。 In sklearn, this can be accomplished as follows: 在sklearn中,可以按以下步骤完成:

scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])

Finally you should consider imputing missing values with sklearn's Imputer as well as filling NaNs with something like the following: 最后,您应该考虑使用sklearn的Imputer估算缺失值,以及用以下内容填充NaN:

train[features].fillna(0, inplace=True)

It happens when you have null-strings (like '') in the dataset. 当数据集中有空字符串(如“”)时,就会发生这种情况。 Also try to print something like 也尝试打印类似

pd.value_counts()

or even 甚至

sorted(list(set(...)))

or get min or max values for each column in the dataset in the loop. 或获取循环中数据集中各列的最小值或最大值。

Above example with MinMaxScaler may work, but scaled features do not work well in the RF. 上面带有MinMaxScaler的示例可能有效,但缩放后的功能在RF中效果不佳。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM