繁体   English   中英

Python 和随机森林,无论我做什么,总是得到错误“ValueError:输入包含 NaN、无穷大或对于 dtype('float32')来说太大的值”

[英]Python and random forest, always get error "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')" no matter what i do

我已经阅读了所有以前的相关问题和答案,当一个人使用随机森林时,数据集中缺少无限值时会发生此错误。 尝试了所有可能,没有一个有效。 这是我的代码:

merge_df.head()
Out[21]: 
                         high       low     close  ...  month  day  hour
timestamp                                          ...                  
2020-11-13 17:00:00  0.004434 -0.005691  0.004348  ...     11   13    17
2020-11-13 18:00:00  0.002759 -0.002144  0.002122  ...     11   13    18
2020-11-13 19:00:00  0.005888 -0.001588  0.002965  ...     11   13    19
2020-11-13 20:00:00  0.000000 -0.008531 -0.008235  ...     11   13    20
2020-11-13 21:00:00  0.005195 -0.000362  0.004067  ...     11   13    21

[5 rows x 50 columns]

# Prepare training/test DataFrames. 
train_end = pd.to_datetime('2021/04/10 12:00:00')
test_start = pd.to_datetime('2021/04/10 13:00:00')
target = 'next'

train_df = merge_df.loc[:train_end]
test_df = merge_df.loc[test_start:] 

X_train = train_df.copy().drop(target, axis=1).values
X_test = test_df.copy().drop(target, axis=1).values
y_train = train_df[target].values
y_test = test_df[target].values
X_train[:] = np.nan_to_num(X_train)

# Perform grid search for hyperparameters. 
def Grid_Search_CV_RFR(X_train, y_train):
    reg = RandomForestRegressor()
    param_grid = { 
            "n_estimators"      : [10,50,100,500],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_leaf" : [1,5,10,20]
            }

    tss_splits = TimeSeriesSplit(n_splits=10).split(X_train)
    grid = GridSearchCV(reg, param_grid, cv=tss_splits, verbose=0)
    #grid = GridSearchCV(reg, param_grid, cv=3, verbose=0)

    grid.fit(X_train, y_train)

    return grid.best_score_ , grid.best_params_

best_score, best_params = Grid_Search_CV_RFR(X_train, y_train)

mf = best_params['max_features']
msl = best_params['min_samples_leaf']
ne = best_params['n_estimators']

# Fit RFR with best parameters from grid search.
rfr = RandomForestRegressor(n_estimators=ne, max_features=mf, min_samples_leaf=msl, random_state=10)
rfr.fit(X_train, y_train)

当我运行 function 搜索最佳参数时,它给出:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

有什么建议吗?

正如例外情况所暗示的那样,您的某些数据集似乎包含 NaN 或无穷大值。 我观察到您仅在火车数据集X_train np.nan_to_num 您可能需要在运行网格搜索之前添加X_test[:] = np.nan_to_num(X_test)

您还可以通过运行train_df.isnull().sum()来检查 Hans,它计算每列中的 null 值,帮助您调试数据。 您可能还应该(或可能处理)任何类型为datetime的列。

找到了解决方案,我没有缺失值,但有无限的值,并解决了这个问题:

train_df[train_df==np.inf]=np.nan

train_df.fillna(train_df.mean(), inplace=True)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM