I'm currently working with a dataset that has 5 columns of numeric variables and 23 columns of categorical variables. These variables are mostly nominal (not ordinal) and can contain anywhere from 4 to 15 different categories. I'm aware of OneHotEncoder but I'm worried that applying something like rfecv would result in individual categories within a given variable being removed from the analysis as opposed to removing entire variables. Thanks!
Here is a function which implements a tree based method for features importance analysis. It will actually return you the original dataframe with only the top n features in order of importance.
from sklearn.ensemble import ExtraTreesClassifier
def select_best_Tree_features(df,target_var,top_n):
"""
:param df: pandas dataframe
:param target_var: string containing the target value column name
:param top_n: integer indicating the number of columns to consider
:return:
"""
Y = df[target_var]
X = df.drop([target_var], axis=1)
model = ExtraTreesClassifier()
model.fit(X, Y)
f = pd.Series(model.feature_importances_, index=X.columns)
f.nlargest(top_n).plot(kind='barh')
plt.show()
print('\nFeatures Scores\n',f.sort_values(ascending=False))
top_list=f.nlargest(top_n).index.tolist()
X_fi = df[top_list]
return X_fi,Y,top_list
make sure every column of the dataframe contains numeric values or use label encoder to convert them
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.