I have a applied scikit decision tree algorithm over my data to get the outcome. Now, I want a mechanism to determine what are the factors that contribute most to the prediction made by my algorithm in user readable format.
Example: Suppose my training and test data are same as the below table.
<table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td>abc</td> <td>1</td> <td>0.5</td> <td>2</td> <td>Success</td> </tr> <tr> <td>pqr</td> <td>1.2</td> <td>0.6</td> <td>1.4</td> <td>Success</td> </tr> <tr> <td>abc</td> <td>0.9</td> <td>1</td> <td>2</td> <td>Failure</td> </tr> </table>
After applying the algorithm, I able able to predict things with a good precision. Now, what I want is to provide users with weights of all the parameters that have contributed to success/failure of the prediction.
Example:
<table border='1'> <thead> <th>Parameter1</th> <th>Parameter2</th> <th>Parameter3</th> <th>Parameter4</th> <th>Class</th> </thead> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#00FF00;">80%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Success</td> </tr> <tr> <td style="background-color:#00BB00;">100%</td> <td style="background-color:#00D500;">90%</td> <td style="background-color:#c9ff00;">70%</td> <td style="background-color:#00D500;">90%</td> <td>Success</td> </tr> <tr> <td style="background-color:#FEF3AD;">50%</td> <td style="background-color:#ff7f39;">10%</td> <td style="background-color:#ff1a00;">5%</td> <td style="background-color:#FEF3AD;">50%</td> <td>Failure</td> </tr> </table>
So, the second table indicates to what extent the associated parameters are contributing towards the success of that particular row.
What I have attempted till now is to have the following mechanism in place:
SELECT Parameter1, COUNT('SUCCESS')/COUNT(*) FROM table and joins WHERE clauses GROUP BY Parameter1;
Adding the parameter correlation coefficient to the Success% obtained from the queries. This step is to add the correlation factors to normal statistical percentages.
Store each parameter in my database: Example:
Parameter1, abc, 50%
Parameter1, pqr, 100%
And so on...
Is there a better or more efficient way of doing this? Please provide the details.
Thank you.
You can use feature_importances_
to know the contribution of each feature. However, the values feature_importances_
returns do not directly consider prediction accuracy.
For the purpose, you can use mean decrease accuracy
to evaluate each feature contribution regarding specific evaluation metric. Following blog post contains good explanation and python sample code.
Selecting good features – Part III: random forests - Diving into data
The main idea of mean decrease accuracy
is that choosing one feature and permutating randomly the feature values among all instances in the dataset to make the feature meaningless.
(A) If accuracy decreases, the selected feature is important for prediction.
(B) If not, the selected feature is not so important for prediction.
Merits of using mean decrease accuracy
are:
(1) You can apply it to any classifiers including ensemble models.
(2) You can apply it to any evaluation metric.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.