简体   繁体   中英

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same. I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same. This is the code that I'm using:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")

target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)

features_list = list(merged_df.columns)

#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)

#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)

print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)

Here's a link to the data file: https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing

If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.

Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:

np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])

This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:

变量“400kmDensity”和“410kmDensity”的散点图

In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward. Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm / Permutation Importance /...

In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.

fitted.rf = rf.fit(scale(train_features), scale(train_target))

As mentioned before, the feature importances after this change unsurprisingly look like this:

在此处输入图像描述

Also, the column "second" holds only the value zero, which does not explain anything, Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data. like checking correlations between columns or generating histograms in order to explore data distributions [...].

There is much more to it, but I hope this gives you a leg-up!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM