简体   繁体   English

如何通过 python 解释随机森林中的树

[英]How to interpret trees from random forest via python

I'm trying to figure out how I can go about interpreting my trees from my random forest.我试图弄清楚如何从我的随机森林中解释我的树。 My data contains around 29,000 observations and 35 features.我的数据包含大约 29,000 个观察值和 35 个特征。 I pasted the first 22 observations, the first 11 features as well as the feature that I am trying to predict(HighLowMobility).我粘贴了前 22 个观察值、前 11 个特征以及我试图预测的特征(HighLowMobility)。

birthcohort countyfipscode  county_name cty_pop2000 statename   state_id    stateabbrv  perm_res_p25_kr24   perm_res_p75_kr24   perm_res_p25_c1823  perm_res_p75_c1823  HighLowMobility
1980    1001    Autauga 43671   Alabama 1   AL  45.2994 60.7061         Low
1981    1001    Autauga 43671   Alabama 1   AL  42.6184 63.2107 29.7232 75.266  Low
1982    1001    Autauga 43671   Alabama 1   AL  48.2699 62.3438 38.0642 72.2544 Low
1983    1001    Autauga 43671   Alabama 1   AL  42.6337 56.4204 38.2588 80.4664 Low
1984    1001    Autauga 43671   Alabama 1   AL  44.0163 62.2799 38.1238 73.747  Low
1985    1001    Autauga 43671   Alabama 1   AL  45.7178 61.3187 40.9339 83.0661 Low
1986    1001    Autauga 43671   Alabama 1   AL  47.9204 59.6553 47.4841 72.491  Low
1987    1001    Autauga 43671   Alabama 1   AL  48.3108 54.042  53.199  84.5379 Low
1988    1001    Autauga 43671   Alabama 1   AL  47.9855 59.42   52.8927 85.2844 Low
1980    1003    Baldwin 140415  Alabama 1   AL  42.4611 51.4142         Low
1981    1003    Baldwin 140415  Alabama 1   AL  43.0029 55.1014 35.5923 76.9857 Low
1982    1003    Baldwin 140415  Alabama 1   AL  46.2496 56.0045 38.679  77.038  Low
1983    1003    Baldwin 140415  Alabama 1   AL  44.3001 54.5173 38.7106 81.0388 Low
1984    1003    Baldwin 140415  Alabama 1   AL  46.4349 55.5245 42.4422 80.3047 Low
1985    1003    Baldwin 140415  Alabama 1   AL  47.1544 52.8189 42.7994 79.0835 Low
1986    1003    Baldwin 140415  Alabama 1   AL  47.553  54.934  42.0653 78.4398 Low
1987    1003    Baldwin 140415  Alabama 1   AL  48.9752 54.3541 39.96   79.4915 Low
1988    1003    Baldwin 140415  Alabama 1   AL  48.6887 55.3087 43.8557 79.387  Low
1980    1005    Barbour 29038   Alabama 1   AL                  Low
1981    1005    Barbour 29038   Alabama 1   AL  37.5338 54.3618 34.8771 75.1904 Low
1982    1005    Barbour 29038   Alabama 1   AL  37.028  57.2471 36.5392 90.3262 Low
1983    1005    Barbour 29038   Alabama 1   AL                  Low

Here is my random forest:这是我的随机森林:

   #loading the data into data frame
   X = pd.read_csv('raw_data_for_edits.csv')
   #Impute the missing values with median values,.
   X = X.fillna(X.median())

  #Dropping the categorical values
  X = X.drop(['county_name','statename','stateabbrv'],axis=1)

  #Collect the output in y variable
  y = X['HighLowMobility']


  X = X.drop(['HighLowMobility'],axis=1)


 from sklearn.preprocessing import LabelEncoder

 #Encoding the output labels
 def preprocess_labels(y):
   yp = []
   #low = 0
   #high = 0
    for i in range(len(y)):
      if (str(y[i]) =='Low'):
         yp.append(0)
         #low +=1
     elif (str(y[i]) =='High'):
         yp.append(1)
         #high +=1
      else:
         yp.append(1)
      return yp



  #y = LabelEncoder().fit_transform(y)
  yp = preprocess_labels(y)
  yp = np.array(yp)
  yp.shape
  X.shape
  from sklearn.cross_validation import train_test_split
  X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
  X_train = np.array(X_train)
  y_train = np.array(y_train)
  X_test = np.array(X_test)
  y_test = np.array(y_test)
  training_data = X_train,y_train
  test_data = X_test,y_test
  dims = X_train.shape[1]
   if __name__ == '__main__':
     nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
     nn.fit(X_train,y_train) 
     weights = nn.final_weights()
     testlabels_out = nn.predict(X_test)
     print testlabels_out
     print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))


  '''
  RANDOM FOREST AND LOGISTIC REGRESSION
  '''
  from sklearn import cross_validation
  from sklearn.linear_model import LogisticRegression
  from sklearn.ensemble import RandomForestClassifier
  clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0,       fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
  clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
   for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
   scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

How would I interpret my trees?我将如何解释我的树? For example,perm_res_p25_c1823 is a feature that states the College attendance at ages 18-23 for child born at 25th percentile, perm_res_p75_c1823 represents the 75th percentile and the HighLowMobility feature states whether it there is High or Low upward income mobility.例如,perm_res_p25_c1823 是一个特征,表示出生在第 25 个百分位的孩子在 18-23 岁时的大学出勤率,perm_res_p75_c1823 代表第 75 个百分位,而 HighLowMobility 特征说明它是否存在高或低的向上收入流动性。 So how would show the following: "If the person comes from 25th percentile and lives Autauga,Alabama, then they will probably have lower upward mobility"?那么如何显示以下内容:“如果该人来自第 25 个百分位并且居住在阿拉巴马州的奥陶加,那么他们的向上流动性可能会较低”?

You cannot really interpret RF in such terms because random forest does not work this way. 您无法真正用这样的术语来解释RF,因为随机森林无法以这种方式工作。 It creates highly randomized ensemble of trees, which can have various decision rules. 它创建了高度随机的树集合,该树可以具有各种决策规则。 Once you go from decision trees, which are fully interpretable, to RF, you loose this aspect of the classifier. 一旦从完全可解释的决策树转到RF,您就失去了分类器的这一方面。 RFs are black boxes . 射频是黑匣子 You can do many different approxiamtions and estimations, but they will efficiently ignore/alternate your RF. 您可以进行许多不同的近似和估算,但是它们将有效地忽略/替代您的RF。

Explainability is a hot research area.可解释性是一个热门研究领域。 Recently, newer tools have been developed to explain tree ensemble models using a handful of human understandable rules.最近,已经开发出更新的工具来使用一些人类可以理解的规则来解释树集成模型。 Here are a few options for explaining tree ensemble models, that you can try:以下是一些解释树集成模型的选项,您可以尝试:

You can use TE2Rules (Tree Ensembles to Rules) to extract human understandable rules to explain a scikit tree ensemble (like GradientBoostingClassifier).您可以使用TE2Rules (Tree Ensembles to Rules)提取人类可理解的规则来解释 scikit 树集成(如 GradientBoostingClassifier)。 It provides levers to control interpretability, fidelity and run time budget to extract useful explanations.它提供了控制可解释性、保真度和运行时间预算的杠杆,以提取有用的解释。 Rules extracted by TE2Rules are guaranteed to closely approximate the tree ensemble, by considering the joint interactions of multiple trees in the ensemble.由 TE2Rules 提取的规则通过考虑集成中多棵树的联合交互作用,可以保证非常接近树集成。

Another, alternative is SkopeRules , which is a part of scikit-contrib.另一个替代方案是SkopeRules ,它是 scikit-contrib 的一部分。 SkopeRules extract rules from individual trees in the ensemble and filters good rules with high precision/recall across the whole ensemble. SkopeRules 从集成中的单个树中提取规则,并在整个集成中以高精度/召回率过滤好的规则。 This is often quick, but may not represent the ensemble well enough.这通常很快,但可能不能很好地代表整体。

For developers who work in R, InTrees package is a good option.对于在 R 工作的开发人员来说,InTrees package 是一个不错的选择。

References:参考:

TE2Rules: You can find the code: https://github.com/groshanlal/TE2Rules and documentation: https://te2rules.readthedocs.io/en/latest/ here. TE2Rules:您可以在此处找到代码: https://github.com/groshanlal/TE2Rules和文档: https://te2rules.readthedocs.io/en/latest/

SkopeRules: You can find the code: https://github.com/scikit-learn-contrib/skope-rules here. SkopeRules:您可以在此处找到代码: https://github.com/scikit-learn-contrib/skope-rules

Intrees: https://cran.r-project.org/web/packages/inTrees/index.html内树: https://cran.r-project.org/web/packages/inTrees/index.html

Disclosure: I'm one of the core developers of TE2Rules.披露:我是 TE2Rules 的核心开发人员之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM