简体   繁体   English

可视化来自预训练模型的样本的优缺点

[英]Visualize strengths and weaknesses of a sample from pre-trained model

Let's say I'm trying to predict an apartment price. 假设我要预测公寓价格。 So, I have a lot of labeled data, where on each apartment I have features that could affect the price like: 因此,我有很多带有标签的数据,在每套公寓的哪些位置我都会影响价格,例如:

  • city
  • street
  • floor 地板
  • year built 建造年份
  • socioeconomic status 社会经济状况
  • square feet 平方英尺
  • etc. 等等

And I train a model, let's say XGBOOST. 我训练一个模型,比方说XGBOOST。 Now, I want to predict the price of a new apartment. 现在,我要预测新公寓的价格。 Is there a good way to show what is "good" in this apartment, and what is bad, and by how much (scaled 0-1)? 有什么好方法可以显示这间公寓中的“好”,什么是坏,以及多少(按0-1缩放)?

For example: The floor number is a "strong" feature (ie - in this area this floor number is desired, thus affects positively on the price of the apartment), but the socioeconomic status is a weak feature (ie the socioeconomic status is low and thus affects negatively on the price of the apartment). 例如:楼层号是一个“强”特征(即-在此区域中,该楼层号是理想的,因此对公寓价格产生积极影响),但是社会经济地位是弱项(即社会经济地位低)因此会对公寓的价格产生负面影响)。

What I want is to illustrate more or less why my model decided on this price, and I want the user to get a feel of the apartment value by those indicators. 我想要的是或多或少地说明我的模型为何决定这个价格,并且我希望用户通过这些指标来了解公寓的价值。

I thought of exhaustive search on each feature - but I'm afraid that will take too much time. 我想到了对每个功能进行详尽的搜索-但恐怕这会花费太多时间。

Is there a more brilliant way of doing this? 有没有更出色的方法?

Any help would be much appreciated... 任何帮助将非常感激...

Happy news for you, there is. 给您的好消息是。

A package called "SHAP" ( SHapley Additive exPlanation ) was recently released just for that purpose. 为此目的,最近发布了一个名为“ SHAP”SHapley Additive exPlanation )的软件包。 Here's a link to the github. 这是指向 github 的链接

It supports visualization of complicated models (which are hard to intuitively explain) like boosted trees (and XGBOOST in particular!) 它支持可视化复杂模型(很难直观地解释),例如增强树(尤其是XGBOOST!)。

It can show you "real" feature importance which is better than the "gain" , "weight" , and "cover" xgboost supplies as they are not consistent. 它可以向您显示“真实”功能的重要性,这比"gain""weight""cover" xgboost耗材更好,因为它们不一致。

You can read all about why SHAP is better for feature evaluation here . 您可以在此处阅读有关SHAP为什么更好地进行功能评估的所有信息。

It will be hard to give you code that will work for you, but there is a good documentation and you should write one that suits you. 很难为您提供适合您的代码,但是有一个很好的文档,您应该编写适合您的文档。

Here's the guide lines of building your first graph: 以下是构建第一个图形的指导原则:

import shap
import xgboost as xgb

# Assume X_train and y_train are both features and labels of data samples

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names, weight=weights_trn)

# Train your xgboost model
bst = xgb.train(params0, dtrain, num_boost_round=2500, evals=watchlist, early_stopping_rounds=200)

# "explainer" object of shap
explainer = shap.TreeExplainer(bst)

# "Values you explain, I took them from my training set but you can "explain" here what ever you want
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To plot the " Why a certain sample got its score " you can either use built in SHAP function for it (only works on a Jupyter Notebook). 要绘制“ 为什么某个样本获得分数 ”,您可以使用内置的SHAP函数(仅在Jupyter Notebook上有效)。 Perfect example here 完美的例子在这里

I personally wrote a function that will plot it using matplotlib , which will take some effort. 我亲自编写了一个函数,该函数将使用matplotlib对其进行绘制,这将需要一些努力。

Here is an example of a plot I've made using the shap values (features are confidential so all erased) 这是我使用shap值绘制的图的一个示例(功能是机密的,因此已全部删除) 在此处输入图片说明

You can see a 97% prediction to be label=1 and each feature and how much it added or negate from the log-loss, for that specific sample. 您可以看到针对该特定样本的97%预测为label=1并且每个功能以及对数损失增加或抵消了多少。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM