[英]Random Forest Classifier ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
[英]Python and random forest, always get error "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')" no matter what i do
我已经阅读了所有以前的相关问题和答案,当一个人使用随机森林时,数据集中缺少无限值时会发生此错误。 尝试了所有可能,没有一个有效。 这是我的代码:
merge_df.head()
Out[21]:
high low close ... month day hour
timestamp ...
2020-11-13 17:00:00 0.004434 -0.005691 0.004348 ... 11 13 17
2020-11-13 18:00:00 0.002759 -0.002144 0.002122 ... 11 13 18
2020-11-13 19:00:00 0.005888 -0.001588 0.002965 ... 11 13 19
2020-11-13 20:00:00 0.000000 -0.008531 -0.008235 ... 11 13 20
2020-11-13 21:00:00 0.005195 -0.000362 0.004067 ... 11 13 21
[5 rows x 50 columns]
# Prepare training/test DataFrames.
train_end = pd.to_datetime('2021/04/10 12:00:00')
test_start = pd.to_datetime('2021/04/10 13:00:00')
target = 'next'
train_df = merge_df.loc[:train_end]
test_df = merge_df.loc[test_start:]
X_train = train_df.copy().drop(target, axis=1).values
X_test = test_df.copy().drop(target, axis=1).values
y_train = train_df[target].values
y_test = test_df[target].values
X_train[:] = np.nan_to_num(X_train)
# Perform grid search for hyperparameters.
def Grid_Search_CV_RFR(X_train, y_train):
reg = RandomForestRegressor()
param_grid = {
"n_estimators" : [10,50,100,500],
"max_features" : ["auto", "sqrt", "log2"],
"min_samples_leaf" : [1,5,10,20]
}
tss_splits = TimeSeriesSplit(n_splits=10).split(X_train)
grid = GridSearchCV(reg, param_grid, cv=tss_splits, verbose=0)
#grid = GridSearchCV(reg, param_grid, cv=3, verbose=0)
grid.fit(X_train, y_train)
return grid.best_score_ , grid.best_params_
best_score, best_params = Grid_Search_CV_RFR(X_train, y_train)
mf = best_params['max_features']
msl = best_params['min_samples_leaf']
ne = best_params['n_estimators']
# Fit RFR with best parameters from grid search.
rfr = RandomForestRegressor(n_estimators=ne, max_features=mf, min_samples_leaf=msl, random_state=10)
rfr.fit(X_train, y_train)
当我运行 function 搜索最佳参数时,它给出:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
有什么建议吗?
正如例外情况所暗示的那样,您的某些数据集似乎包含 NaN 或无穷大值。 我观察到您仅在火车数据集X_train
np.nan_to_num
您可能需要在运行网格搜索之前添加X_test[:] = np.nan_to_num(X_test)
。
您还可以通过运行train_df.isnull().sum()
来检查 Hans,它计算每列中的 null 值,帮助您调试数据。 您可能还应该(或可能处理)任何类型为datetime
的列。
找到了解决方案,我没有缺失值,但有无限的值,并解决了这个问题:
train_df[train_df==np.inf]=np.nan
train_df.fillna(train_df.mean(), inplace=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.