简体   繁体   中英

Feature selection using scikit-learn on categorical features

I'm currently working with a dataset that has 5 columns of numeric variables and 23 columns of categorical variables. These variables are mostly nominal (not ordinal) and can contain anywhere from 4 to 15 different categories. I'm aware of OneHotEncoder but I'm worried that applying something like rfecv would result in individual categories within a given variable being removed from the analysis as opposed to removing entire variables. Thanks!

Here is a function which implements a tree based method for features importance analysis. It will actually return you the original dataframe with only the top n features in order of importance.

from sklearn.ensemble import ExtraTreesClassifier

def select_best_Tree_features(df,target_var,top_n):
"""
:param df: pandas dataframe
:param target_var: string containing the target value column name 
:param top_n: integer indicating the number of columns to consider
:return:
"""
Y = df[target_var]
X = df.drop([target_var], axis=1)
model = ExtraTreesClassifier()
model.fit(X, Y)
f = pd.Series(model.feature_importances_, index=X.columns)
f.nlargest(top_n).plot(kind='barh')
plt.show()
print('\nFeatures Scores\n',f.sort_values(ascending=False))
top_list=f.nlargest(top_n).index.tolist()
X_fi = df[top_list]
return X_fi,Y,top_list

make sure every column of the dataframe contains numeric values or use label encoder to convert them

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM