简体   繁体   中英

Supervised Machine Learning: Getting weights for each individual parameter for classifications

I have a applied scikit decision tree algorithm over my data to get the outcome. Now, I want a mechanism to determine what are the factors that contribute most to the prediction made by my algorithm in user readable format.

Example: Suppose my training and test data are same as the below table.

 <table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td>abc</td> <td>1</td> <td>0.5</td> <td>2</td> <td>Success</td> </tr> <tr> <td>pqr</td> <td>1.2</td> <td>0.6</td> <td>1.4</td> <td>Success</td> </tr> <tr> <td>abc</td> <td>0.9</td> <td>1</td> <td>2</td> <td>Failure</td> </tr> </table> 

After applying the algorithm, I able able to predict things with a good precision. Now, what I want is to provide users with weights of all the parameters that have contributed to success/failure of the prediction.

Example:

  <table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Success</td> </tr> <tr> <td style="background-color:#00BB00;">100%</td> <td style="background-color:#00D500;">90%</td> <td style="background-color:#c9ff00;">70%</td> <td style="background-color:#00D500;">90%</td> <td>Success</td> </tr> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#ff7f39;">10%</td> <td style="background-color:#ff1a00;">5%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Failure</td> </tr> </table> 

So, the second table indicates to what extent the associated parameters are contributing towards the success of that particular row.


What I have attempted till now is to have the following mechanism in place:

  1. I am finding the correlation coefficient using Kendalltau for all the parameters.
  2. For all the parameters, firing group by queries to get the success percent:
  SELECT Parameter1, COUNT('SUCCESS')/COUNT(*) FROM table and joins WHERE clauses GROUP BY Parameter1; 
  1. Adding the parameter correlation coefficient to the Success% obtained from the queries. This step is to add the correlation factors to normal statistical percentages.

  2. Store each parameter in my database: Example:

    Parameter1, abc, 50%

    Parameter1, pqr, 100%

    And so on...


Is there a better or more efficient way of doing this? Please provide the details.

Thank you.

You can use feature_importances_ to know the contribution of each feature. However, the values feature_importances_ returns do not directly consider prediction accuracy.

For the purpose, you can use mean decrease accuracy to evaluate each feature contribution regarding specific evaluation metric. Following blog post contains good explanation and python sample code.

Selecting good features – Part III: random forests - Diving into data

The main idea of mean decrease accuracy is that choosing one feature and permutating randomly the feature values among all instances in the dataset to make the feature meaningless.

(A) If accuracy decreases, the selected feature is important for prediction.
(B) If not, the selected feature is not so important for prediction.

Merits of using mean decrease accuracy are:

(1) You can apply it to any classifiers including ensemble models.
(2) You can apply it to any evaluation metric.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM