简体   繁体   English

为什么我的残差正态 QQ 图是一条垂直线?

[英]Why is my Normal Q-Q Plot of residuals a vertical line?

I am using a QQ Plot to test if the residuals of my linear regression follow a normal distribution but the result is a vertical line.我正在使用 QQ 图来测试我的线性回归的残差是否遵循正态分布,但结果是一条垂直线。

It looks like linear regression is a pretty good model for this dataset, so shouldn't the residuals be normally distributed ?看起来线性回归是这个数据集的一个很好的模型,所以残差不应该是正态分布的吗?

用于预测的回归线和非标准化点。

残差图

QQ图

The points were created randomly:这些点是随机创建的:

import numpy as np

x_values = np.linspace(0, 5, 100)[:, np.newaxis]
y_values = 29 * x_values + 30 * np.random.rand(100, 1)

Then, I fitted a Linear Regression model:然后,我拟合了一个线性回归模型:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_values, y_values)
predictions = reg.predict(x_values)
residuals = y_values - predictions

Finally, I used the statsmodel module to plot the QQ Plot of the residuals:最后,我使用 statsmodel 模块绘制残差的 QQ 图:

fig = sm.qqplot(residuals, line='45')

The whole idea of a QQ plot is to compare the quantiles of a true normal distribution against those of your residuals. QQ 图的整个想法是将真实正态分布的分位数与残差的分位数进行比较。

Hence, if the quantiles of the theoretical distribution (which is in fact normal) match those of your residuals (aka, they look like a straight line when plotted against each other), then you can conclude that the model from which you derived those residuals is a good fit.因此,如果理论分布的分位数(实际上是正态的)与残差的分位数相匹配(也就是,它们在相互绘制时看起来像一条直线),那么您可以得出结论,您从中得出这些残差的模型很合适。

shouldn't the residuals be normally distributed?残差不应该是正态分布的吗?

If you look at your code, you'll see that the random terms in y_values are NOT normally distributed (you generated them with numpy.random.rand , which does not follow a normal distribution).如果您查看您的代码,您会发现y_values中的随机项不是正态分布的(您使用numpy.random.rand生成它们,它不遵循正态分布)。

Regardless, if your QQ plot looks like a straight line, that means that your residuals and the residuals from a normal distribution match.无论如何,如果您的 QQ 图看起来像一条直线,这意味着您的残差和正态分布的残差匹配。

I think you were expecting to see the distribution of your errors.我认为您期望看到错误的分布。 That is, you're looking for a histogram, not a QQ plot!也就是说,您正在寻找直方图,而不是 QQ 图!

# Import pyplot
from matplotlib import pyplot as plt

# Plot histogram of residuals
plt.hist(residuals, bins=10)
plt.show()

直方图

In my case, it does not look like the residuals follow a normal distribution, but this will change on your end because you did not set a random state when generating y_values .在我的情况下,残差看起来并不遵循正态分布,但这会在您的最终改变,因为您在生成y_values时没有设置随机状态。

Your problem is two-fold here您的问题在这里有两个方面

  1. The primary problem is that sklearn (scikit learn) expects your input to be in a 2d columnar array, whereas qqplot from statsmodels expects your data to be in a true 1d array.主要问题是sklearn (scikit learn) 期望您的输入位于二维列数组中,而来自qqplotstatsmodels期望您的数据位于真正的一维数组中。 When you're passing the residuals to qqplot it is attempting to transform each residual individually instead of an entire dataset当您将残差传递给qqplot时,它会尝试单独转换每个残差而不是整个数据集

  2. numpy.random.rand is a uniform distribution, so your errors aren't normal to begin with! numpy.random.rand是一个均匀分布,所以你的错误一开始就不正常!

To highlight this, I've adapted your code sample.为了突出这一点,我调整了您的代码示例。 The top row in the resultant figure comprises predictions & residuals for a uniform residual distribution, whereas the bottom row uses a normal distribution for errors.结果图中的顶行包含均匀残差分布的预测和残差,而底行使用误差的正态分布。

The difference between the "qq_bad" and "qq_good" plots simply has to do with selecting the column of data and passing it in as a true 1d array (instead of a 1d columnar array). "qq_bad""qq_good"图之间的区别仅仅在于选择数据列并将其作为真正的一维数组(而不是一维柱状数组)传递。

from matplotlib.pyplot import subplot_mosaic, show, rc
from matplotlib.lines import Line2D
from matplotlib.transforms import blended_transform_factory
from numpy.random import default_rng
from numpy import linspace
from sklearn.linear_model import LinearRegression
from statsmodels.api import qqplot
from scipy.stats import zscore

rc('font', size=14)
rc('axes.spines', top=False, right=False)

rng = default_rng(0)
size = 100

x_values = linspace(0, 5, size)[:, None]
errors = {
    'uniform': rng.uniform(low=-50, high=50, size=(size, 1)),
    'normal':  rng.normal(loc=0, scale=15, size=(size, 1))
}

fig, axd = subplot_mosaic([
    ['uniform_fit', 'uniform_hist', 'uniform_qq_bad', 'uniform_qq_good'],
    ['normal_fit', 'normal_hist', 'normal_qq_bad', 'normal_qq_good']
], figsize=(12, 6), gridspec_kw={'wspace': .4, 'hspace': .2})

for err_type, err in errors.items():
    reg = LinearRegression()
    y_values = 29 * x_values + 30 + err

    fit = reg.fit(x_values, y_values)
    predictions = fit.predict(x_values)
    residuals = predictions - y_values

    axd[f'{err_type}_fit'].scatter(x_values, y_values, s=10, alpha=.8)
    axd[f'{err_type}_fit'].plot(x_values, predictions)

    axd[f'{err_type}_hist'].hist(residuals, bins=20)

    qqplot(residuals, ax=axd[f'{err_type}_qq_bad'], line='q')
    qqplot(residuals[:, 0], ax=axd[f'{err_type}_qq_good'], line='q')

####
# Below is primarily for plot aesthetics, feel free to ignore

for label, ax in axd.items():
    ax.set_ylabel(None)
    ax.set_xlabel(None)

    if label.startswith('uniform'):
        ax.set_title(label.replace('uniform_', '').replace('_', ' '))

    if label.endswith('fit'):
        ax.set_ylabel(f'{label.replace("_fit", "")} error')

line = Line2D(
    [.05, .95], [1.04, 1.04],
    color='black',
    transform=blended_transform_factory(fig.transFigure, ax.transAxes),
)
fig.add_artist(line)

show()

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM