简体   繁体   English

有监督的机器学习:获取每个单独参数的权重以进行分类

[英]Supervised Machine Learning: Getting weights for each individual parameter for classifications

I have a applied scikit decision tree algorithm over my data to get the outcome. 我对数据使用了scikit决策树算法以获取结果。 Now, I want a mechanism to determine what are the factors that contribute most to the prediction made by my algorithm in user readable format. 现在,我需要一种机制来确定哪些因素对用户可读格式的算法做出的预测有最大的影响。

Example: Suppose my training and test data are same as the below table. 示例:假设我的训练和测试数据与下表相同。

 <table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td>abc</td> <td>1</td> <td>0.5</td> <td>2</td> <td>Success</td> </tr> <tr> <td>pqr</td> <td>1.2</td> <td>0.6</td> <td>1.4</td> <td>Success</td> </tr> <tr> <td>abc</td> <td>0.9</td> <td>1</td> <td>2</td> <td>Failure</td> </tr> </table> 

After applying the algorithm, I able able to predict things with a good precision. 应用该算法后,我能够很好地预测事物。 Now, what I want is to provide users with weights of all the parameters that have contributed to success/failure of the prediction. 现在,我想要为用户提供有助于预测成功/失败的所有参数的权重。

Example: 例:

  <table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Success</td> </tr> <tr> <td style="background-color:#00BB00;">100%</td> <td style="background-color:#00D500;">90%</td> <td style="background-color:#c9ff00;">70%</td> <td style="background-color:#00D500;">90%</td> <td>Success</td> </tr> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#ff7f39;">10%</td> <td style="background-color:#ff1a00;">5%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Failure</td> </tr> </table> 

So, the second table indicates to what extent the associated parameters are contributing towards the success of that particular row. 因此,第二张表指示相关参数在多大程度上有助于该特定行的成功。


What I have attempted till now is to have the following mechanism in place: 到目前为止,我一直在尝试建立以下机制:

  1. I am finding the correlation coefficient using Kendalltau for all the parameters. 我正在使用Kendalltau为所有参数找到相关系数。
  2. For all the parameters, firing group by queries to get the success percent: 对于所有参数,按查询触发分组以获取成功百分比:
  SELECT Parameter1, COUNT('SUCCESS')/COUNT(*) FROM table and joins WHERE clauses GROUP BY Parameter1; 
  1. Adding the parameter correlation coefficient to the Success% obtained from the queries. 将参数相关系数添加到从查询获得的成功百分比中。 This step is to add the correlation factors to normal statistical percentages. 此步骤是将相关因子添加到正常统计百分比中。

  2. Store each parameter in my database: Example: 将每个参数存储在我的数据库中:示例:

    Parameter1, abc, 50% 参数1,abc,50%

    Parameter1, pqr, 100% 参数1,pqr,100%

    And so on... 等等...


Is there a better or more efficient way of doing this? 是否有更好或更有效的方法? Please provide the details. 请提供详细信息。

Thank you. 谢谢。

You can use feature_importances_ to know the contribution of each feature. 您可以使用feature_importances_来了解每个功能的贡献。 However, the values feature_importances_ returns do not directly consider prediction accuracy. 但是,返回的feature_importances_值并不直接考虑预测准确性。

For the purpose, you can use mean decrease accuracy to evaluate each feature contribution regarding specific evaluation metric. 为此,您可以使用mean decrease accuracy来评估与特定评估指标有关的每个要素贡献。 Following blog post contains good explanation and python sample code. 以下博客文章包含很好的解释和python示例代码。

Selecting good features – Part III: random forests - Diving into data 选择好的功能–第三部分:随机森林-深入数据

The main idea of mean decrease accuracy is that choosing one feature and permutating randomly the feature values among all instances in the dataset to make the feature meaningless. mean decrease accuracy的主要思想是选择一个特征并在数据集中所有实例之间随机排列特征值以使该特征无意义。

(A) If accuracy decreases, the selected feature is important for prediction.
(B) If not, the selected feature is not so important for prediction.

Merits of using mean decrease accuracy are: 使用mean decrease accuracy优点是:

(1) You can apply it to any classifiers including ensemble models.
(2) You can apply it to any evaluation metric.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM