简体   繁体   中英

Feature selection by Pearson correlation or Feature importance in Random Forest

I am a little bit confused; I have a dataset that one feature has shown the least significant relationship with the target variable, however, after assessing feature importance. it shows the most significant relationship with the target variable as shown in the image, In the image below, the variable called "diff" is the target. and the variable called "hour" is the independent feature? Is it possible that one feature shows the least significant relationship based on Pearson correlation but the most significant one based on feature importance, If so? then which one is a reference for feature selection? Pearson correlation or feature importance? Pearson Correlation vs Feature Importance

I think this is possible. A correlation quantifies linear relationships. The two variables may not have a linear relationship, thus showing a low correlation coefficient. This doesn't mean there isn't any relationship. There could be a quadratic, cubic etc. relationship. This non-linear relationship may be the basis for a lot of the decisions of the tree-based model, thus getting a relatively high importance score. I recommend looking at a plot of the two features. A scatterplot could reveal non-linear relationships. As for which score to use: The feature importance is specific to your model. If you plan to stick with your tree-based model use that. If you plan to use a linear model the correlation can give you a decent idea of useful features, for a non-linear model they might not be great.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM