简体   繁体   中英

Visualize strengths and weaknesses of a sample from pre-trained model

Let's say I'm trying to predict an apartment price. So, I have a lot of labeled data, where on each apartment I have features that could affect the price like:

  • city
  • street
  • floor
  • year built
  • socioeconomic status
  • square feet
  • etc.

And I train a model, let's say XGBOOST. Now, I want to predict the price of a new apartment. Is there a good way to show what is "good" in this apartment, and what is bad, and by how much (scaled 0-1)?

For example: The floor number is a "strong" feature (ie - in this area this floor number is desired, thus affects positively on the price of the apartment), but the socioeconomic status is a weak feature (ie the socioeconomic status is low and thus affects negatively on the price of the apartment).

What I want is to illustrate more or less why my model decided on this price, and I want the user to get a feel of the apartment value by those indicators.

I thought of exhaustive search on each feature - but I'm afraid that will take too much time.

Is there a more brilliant way of doing this?

Any help would be much appreciated...

Happy news for you, there is.

A package called "SHAP" ( SHapley Additive exPlanation ) was recently released just for that purpose. Here's a link to the github.

It supports visualization of complicated models (which are hard to intuitively explain) like boosted trees (and XGBOOST in particular!)

It can show you "real" feature importance which is better than the "gain" , "weight" , and "cover" xgboost supplies as they are not consistent.

You can read all about why SHAP is better for feature evaluation here .

It will be hard to give you code that will work for you, but there is a good documentation and you should write one that suits you.

Here's the guide lines of building your first graph:

import shap
import xgboost as xgb

# Assume X_train and y_train are both features and labels of data samples

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names, weight=weights_trn)

# Train your xgboost model
bst = xgb.train(params0, dtrain, num_boost_round=2500, evals=watchlist, early_stopping_rounds=200)

# "explainer" object of shap
explainer = shap.TreeExplainer(bst)

# "Values you explain, I took them from my training set but you can "explain" here what ever you want
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To plot the " Why a certain sample got its score " you can either use built in SHAP function for it (only works on a Jupyter Notebook). Perfect example here

I personally wrote a function that will plot it using matplotlib , which will take some effort.

Here is an example of a plot I've made using the shap values (features are confidential so all erased) 在此处输入图片说明

You can see a 97% prediction to be label=1 and each feature and how much it added or negate from the log-loss, for that specific sample.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM