简体   繁体   中英

Python and random forest, always get error "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')" no matter what i do

I have read all previous related questions and answers here on this error occurring when one uses random forest with missing of infinite values in dataset. Tried out all possibilities, none works. This is my code:

merge_df.head()
Out[21]: 
                         high       low     close  ...  month  day  hour
timestamp                                          ...                  
2020-11-13 17:00:00  0.004434 -0.005691  0.004348  ...     11   13    17
2020-11-13 18:00:00  0.002759 -0.002144  0.002122  ...     11   13    18
2020-11-13 19:00:00  0.005888 -0.001588  0.002965  ...     11   13    19
2020-11-13 20:00:00  0.000000 -0.008531 -0.008235  ...     11   13    20
2020-11-13 21:00:00  0.005195 -0.000362  0.004067  ...     11   13    21

[5 rows x 50 columns]

# Prepare training/test DataFrames. 
train_end = pd.to_datetime('2021/04/10 12:00:00')
test_start = pd.to_datetime('2021/04/10 13:00:00')
target = 'next'

train_df = merge_df.loc[:train_end]
test_df = merge_df.loc[test_start:] 

X_train = train_df.copy().drop(target, axis=1).values
X_test = test_df.copy().drop(target, axis=1).values
y_train = train_df[target].values
y_test = test_df[target].values
X_train[:] = np.nan_to_num(X_train)

# Perform grid search for hyperparameters. 
def Grid_Search_CV_RFR(X_train, y_train):
    reg = RandomForestRegressor()
    param_grid = { 
            "n_estimators"      : [10,50,100,500],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_leaf" : [1,5,10,20]
            }

    tss_splits = TimeSeriesSplit(n_splits=10).split(X_train)
    grid = GridSearchCV(reg, param_grid, cv=tss_splits, verbose=0)
    #grid = GridSearchCV(reg, param_grid, cv=3, verbose=0)

    grid.fit(X_train, y_train)

    return grid.best_score_ , grid.best_params_

best_score, best_params = Grid_Search_CV_RFR(X_train, y_train)

mf = best_params['max_features']
msl = best_params['min_samples_leaf']
ne = best_params['n_estimators']

# Fit RFR with best parameters from grid search.
rfr = RandomForestRegressor(n_estimators=ne, max_features=mf, min_samples_leaf=msl, random_state=10)
rfr.fit(X_train, y_train)

When I run the function for searching the best parameters, it gives:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Any suggestion?

It seems that some of your datasets contain, as the exception suggest, NaN or infinity values. I observed that you invoke np.nan_to_num only on your train dataset X_train . You may want to add X_test[:] = np.nan_to_num(X_test) before running the grid search.

You may also check for Hans by running train_df.isnull().sum() which counts the null values in each column helping you debugging your data. You may also should (or probably process) any column of type datetime .

Found the solution, I had no missing values but infinite ones, and solved with this:

train_df[train_df==np.inf]=np.nan

train_df.fillna(train_df.mean(), inplace=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM