Python：如何处理回归 QQ 图中的异常值？

Question

I draw the qq plot multiple regression and I got below graph.我画了qq图多元回归，我得到了下面的图。 Can someone tell me why there are two points under the red line?谁能告诉我为什么红线下面有两个点？ And do these points have an effect on my model?这些点对我的模型有影响吗？

I used below code for draw the graph.我使用下面的代码来绘制图形。

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(x_train,y_train)

pred_reg_GS = reg.predict(x_test)
diff= y_test-pred_reg_GS

import statsmodels.api as sm
sm.qqplot(diff,fit=True,line='45')
plt.show()

Answer 1

Take a look at Understanding QQ Plots for a concise description of what a QQ plot is.查看了解 QQ 图以简要说明什么是 QQ 图。 In your case, this particular part is important:在您的情况下，此特定部分很重要：

If both sets of quantiles came from the same distribution, we should see the points forming a line that's roughly straight.如果两组分位数来自相同的分布，我们应该看到这些点形成一条大致笔直的线。

This theoretical one-to-one relationship is illustrated explicitly in your plot using the red line.图中使用红线明确说明了这种理论上的一对一关系。

And regarding your question...关于你的问题...

that points effect for my model?这对我的模型有影响吗？

... one or both points that occur far from that red line could be conisered to be outliers. ...远离红线的一个或两个点可能被认为是异常值。 This means that whatever model you've tried to build here does not capture the properties of those tho observations.这意味着您在此处尝试构建的任何模型都无法捕获这些观察的属性。 If what we're looking at here is a QQ plot of the residuals from a regression model, you should take a closer look at those two observations.如果我们在这里看到的是回归模型残差的 QQ 图，您应该仔细查看这两个观察结果。 What is it with these two that make them stand out from the rest of your sample?这两个是什么让它们从您的其他样本中脱颖而出？ One way to "catch" these outliers is often to represent them with one or two dummy variables. “捕捉”这些异常值的一种方法通常是用一两个虚拟变量来表示它们。

Edit 1: Basic approach for outliers and dummy variables编辑 1：异常值和虚拟变量的基本方法

Since you haven't explicitly labeled your question sklearn I'm taking the liberty to illustrate this using statsmodels .由于您没有明确标记您的问题sklearn我冒昧地使用statsmodels来说明这statsmodels 。 And in lieu of a sample of your data, I'll just use the built-in iris dataset where the last part of what we'll use looks like this:代替您的数据样本，我将使用内置的iris数据集，其中我们将使用的最后一部分如下所示：

1. Linear regression of sepal_width on sepal_length 1. sepal_width 对 sepal_length 的线性回归

Plot 1:情节 1：

Looks good!看起来挺好的！ Nothing wrong here.这里没有错。 But let's mix it up a bit by adding some extreme values to the dataset.但是让我们通过向数据集添加一些极值来混合一下。 You'll find a complete code snippet at the end.你会在最后找到一个完整的代码片段。

2. Introduce an outlier 2. 引入异常值

Now, lets add a line in the dataframe where ``sepal_width = 8 instead of 3`.现在，让我们在数据框中添加一行“sepal_width = 8 instead of 3”。 This will give you the following qqplot with a very clear outlier:这将为您提供以下带有非常清晰异常值的 qqplot：

And here's a part of the model summary:这是模型摘要的一部分：

===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
sepal_width     1.8690      0.033     57.246      0.000       1.804       1.934
==============================================================================
Omnibus:                       18.144   Durbin-Watson:                   0.427
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                7.909
Skew:                          -0.338   Prob(JB):                       0.0192
Kurtosis:                       2.101   Cond. No.                         1.00
==============================================================================

So why is this an outlier?那么为什么这是一个异常值呢？ Because we messed with the dataset.因为我们弄乱了数据集。 The reason for the outliers in your dataset is impossible for me to determine.我无法确定数据集中出现异常值的原因。 In our made-up example the reason for a setosa iris to have a sepal width if 8 could be many.在我们的虚构示例中，如果 8 可能很多，则 setosa iris 具有萼片宽度的原因。 Maybe the scientist labeled it wrong?也许科学家给它贴错了标签？ Maybe it isn't a setosa at all?也许它根本不是setosa？ Or maybe it has been genetically modified?或者它可能是转基因的？ Now, instead of just discarding this observation from the sample, it's usually more informative to keep it where it is, accept that there is something special with this observation, and illustrate exactly that by including a dummy variable that is 1 for that observation and 0 for all other.现在，不是仅仅从样本中丢弃这个观察结果，通常将其保留在原处会提供更多信息，接受这个观察结果有一些特别之处，并通过包含一个虚拟变量来准确说明这一点，该虚拟变量为该观察结果为1为0对于所有其他。 Now the last part of your dataframe should look like this:现在你的数据框的最后一部分应该是这样的：

3. Identify the outlier using a dummy variable 3. 使用虚拟变量识别异常值

Now, your qqplot will look like this:现在，您的 qqplot 将如下所示：

And here's your model summary:这是您的模型摘要：

=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
sepal_width       1.4512      0.015     94.613      0.000       1.420       1.482
outlier_dummy    -6.6097      0.394    -16.791      0.000      -7.401      -5.819
==============================================================================
Omnibus:                        1.917   Durbin-Watson:                   2.188
Prob(Omnibus):                  0.383   Jarque-Bera (JB):                1.066
Skew:                           0.218   Prob(JB):                        0.587
Kurtosis:                       3.558   Cond. No.                         27.0
==============================================================================

Notice that the inclusion of a dummy variable changes the coefficient estimate for sepal_widht , and also the values for Skewness and Kurtosis .请注意，一个虚拟变量的包含改变所述系数估计sepal_widht ，也是值Skewness和Kurtosis 。 And that's the short version of the effects an outlier will have on your model.这就是异常值对您的模型的影响的简短版本。

Complete code:完整代码：

import numpy as np
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import seaborn as sns

# sample data
df = pd.DataFrame(sns.load_dataset('iris'))

# subset of sample data
df=df[df['species']=='setosa']

# add column for dummy variable
df['outlier_dummy']=0

# append line with extreme value for sepal width
# as well as a dummy variable = 1 for that row.
df.loc[len(df)] = [5,8,1.4, 0.3, 'setosa', 1]

# define independent variables
x=['sepal_width', 'outlier_dummy']

# run regression
mod_fit = sm.OLS(df['sepal_length'], df[x]).fit()
res = mod_fit.resid

fig = sm.qqplot(res)
plt.show()
mod_fit.summary()

Python：如何处理回归 QQ 图中的异常值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-01-03 14:38:49

1. Linear regression of sepal_width on sepal_length 1. sepal_width 对 sepal_length 的线性回归

2. Introduce an outlier 2. 引入异常值

3. Identify the outlier using a dummy variable 3. 使用虚拟变量识别异常值

Python：如何处理回归 QQ 图中的异常值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-01-03 14:38:49

1. Linear regression of sepal_width on sepal_length 1. sepal_width 对 sepal_length 的线性回归

2. Introduce an outlier 2. 引入异常值

3. Identify the outlier using a dummy variable 3. 使用虚拟变量识别异常值

解决方案1
2 已采纳 2020-01-03 14:38:49