简体   繁体   English

如何从决策树中对功能进行取消编码以查看重要功能?

[英]How may I un-encode the features from a decision tree to see the important features?

I have a dataset that I am working with. 我有一个正在使用的数据集。 I am converting them from categorical features to numerical features for my decision tree. 我正在将它们从分类特征转换为决策树的数字特征。 The conversion happens on the entire data frame with the following lines: 转换在整个数据帧上进行,并包含以下几行:

le = LE()
df = df.apply(le.fit_transform)

I later take this data and split it into training and testing data with the following: 稍后,我将使用以下数据,并将其分为训练和测试数据:

target = ['label']
df_y = df['label']
df_x = df.drop(target, axis=1)

# Split into training and testing data
train_x, test_x, train_y, test_y = tts(df_x, df_y, test_size=0.3, random_state=42)  

Then I am passing it to a method to train a decision tree: 然后,我将其传递给训练决策树的方法:

def Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le):
    print " - Candidate: Decision Tree Classifier"
    dec_tree_classifier = DecisionTreeClassifier(random_state=0) # Load Module
    dec_tree_classifier.fit(train_x, train_y) # Fit
    accuracy = dec_tree_classifier.score(test_x, test_y) # Acc
    predicted = dec_tree_classifier.predict(test_x)
    mse = mean_squared_error(test_y, predicted)

    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))
    print "Tree Features:"
    print tree_feat
    print "Tree Thresholds:"
    print dec_tree_classifier.tree_.threshold

    scores = cross_val_score(dec_tree_classifier, test_x, test_y.values.ravel(), cv=10)
    return (accuracy, mse, scores.mean(), scores.std())

In the above method, I am passing the LabelEncoder object originally used to encode the dataframe. 在上述方法中,我传递了最初用于编码数据帧的LabelEncoder对象。 I have the line 我有线

tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

To try and convert the features back to their original categoric representation, but I keep getting this stack trace error: 尝试将功能转换回其原始分类表示,但我不断收到此堆栈跟踪错误:

  File "<ipython-input-6-c2005f8661bc>", line 1, in <module>
    runfile('main.py', wdir='/Users/mydir)

  File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 668, in runfile
    execfile(filename, namespace)

  File "/Users/me/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
    builtins.execfile(filename, *where)

  File "/Users/me/mydir/main.py", line 125, in <module>
    main()  # Run main routine

  File "candidates.py", line 175, in get_baseline
    dec_tre_acc = Decision_Tree_Classifier(train_x, train_y, test_x, test_y, le)

  File "candidates.py", line 40, in Decision_Tree_Classifier
    tree_feat = list(le.inverse_transform(dec_tree_classifier.tree_.feature))

  File "/Users/me/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 281, in inverse_transform
    "y contains previously unseen labels: %s" % str(diff))

ValueError: y contains previously unseen labels: [-2]

What do I need to change to be able to look at the actual features themselves? 我需要更改什么才能能够查看实际功能?

When you do this: 执行此操作时:

df = df.apply(le.fit_transform)

you are using a single LabelEncoder instance for all of your columns. 您正在对所有列使用单个LabelEncoder实例。 When called fit() or fit_transform() , le will forget the previous data and learn the current data only. 当调用fit()fit_transform()le将忘记先前的数据,而仅学习当前数据。 So the le you have is only storing the information about the last column it seen, not all columns. 因此, le你只存储有关的最后一列就看出,并非所有列的信息。

There are multiple ways to solve this: 有多种解决方法:

  1. You can maintain multiple LabelEncoder objects (one for each column). 您可以维护多个LabelEncoder对象(每列一个)。 See this excellent answer here: 在这里查看此出色的答案:

  2. If you want to keep a single object to handle all columns, you can use the OrdinalEncoder if you have the latest version of scikit-learn installed. 如果要保留一个对象来处理所有列,则可以使用OrdinalEncoder如果已安装最新版本的scikit-learn)。

     from sklearn.preprocessing import OrdinalEncoder enc = OrdinalEncoder() df = enc.fit_transform(df) 

But still the error will not be solved, because the tree_.feature dont correspond to values of the features, but the index (column in df ) that was used for splitting at that node. 但是仍然无法解决错误,因为tree_.feature不对应于tree_.feature值,而是对应于在该节点处进行拆分的索引( df列)。 So if you have 3 features (Columns) in the data (irrespective of values in that column), the tree_.feature can have values: 因此,如果数据中有3个tree_.feature (列)(与该列中的值tree_.feature ),则tree_.feature可以具有值:

  • 0, 1, 2, -2 0,1,2,-2

  • -2 is a special placeholder value to denote that the node is a leaf node, and so no feature is used to split anything. -2是一个特殊的占位符值,表示该节点是叶节点,因此不使用任何功能来分割任何内容。

tree_.threshold will contain the values corresponding to your values of data. tree_.threshold将包含与您的数据值相对应的值。 But that will be in floats, so there you will have to convert according the conversion of categories to numbers. 但这将是浮点数,因此您将必须根据类别到数字的转换进行转换。

See this example for understanding the tree structure in detail: 请参阅以下示例,以详细了解树结构:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 确定为什么特征在决策树模型中很重要 - Determine WHY Features Are Important in Decision Tree Models 如何使用决策树中的 feature_importances_ 删除所有非零重要特征? - How to remove all non zero important features using feature_importances_ in Decision Tree? 如何手动 select 决策树的特征 - How to manually select the features of the decision tree scikit-learn 决策树重要特征顺序和最大深度选择关系 - scikit-learn decision tree important features order and max-depth choice relation 如何寻找随机森林树/决策树的特征? - How can look for the features of random forest tree/decision treee? 如何将混合(分类和数字)特征传递给 sklearn 中的决策树回归器? - how to pass mixed (categorical and numeric) features to Decision Tree Regressor in sklearn? 如何为决策树的连续特征选择拆分变量 - How to choose a split variables for continous features for decision tree 如何识别 LSTM 中的重要特征 - How to identify important features in LSTM 有什么方法可以可视化决策树(sklearn),其中分类特征从一个热编码特征中合并而来? - Is there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features? Python sklearn决策树分类器具有多个功能? - Python sklearn decision tree classifier with multiple features?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM