简体   繁体   English

我如何解释这个 plot 和摘要(多元线性回归)

[英]How do i interpret this plot and summary (multivariable linear regression)

I am not 100% sure how to interpret the plot for multivariable linear regression, especially everything besides the normal QQ one.我不是 100% 确定如何将 plot 解释为多变量线性回归,尤其是除了普通 QQ 以外的所有内容。

From my understanding, the plot showed linearity or the model is a good fit.根据我的理解,plot 表现出线性或者 model 是一个很好的选择。

多元线性回归

As for the summary, I think it showed some pretty good results based on R^2 and adjusted r squared alongside F-statistic and T/p-value.至于总结,我认为它显示了一些基于 R^2 的相当不错的结果,并调整了 r 的平方以及 F 统计量和 T/p 值。

回归模型总结

The Plots地块

First, your plots...首先,你的情节......

在此处输入图像描述

The first plot (top left) is your residuals vs fitted plot shows your fitted values (what the regression predicts that your value should be) and your residual values (how badly it predicted).第一个 plot(左上角)是你的残差与拟合 plot显示你的拟合值(回归预测你的值应该是什么)和你的残差值(预测有多糟糕)。 They should be fairly evenly distributed around the center line, or else this may hint that there is issues with equality of variance or curvilinearity.它们应该相当均匀地分布在中心线周围,否则这可能暗示方差相等或曲线存在问题。 From the looks of your plot, it looks like your data is fairly smooshed into the left side, hinting that your data is not evenly spread out on a scatter plot.从您的 plot 的外观来看,您的数据似乎被完全挤到了左侧,暗示您的数据没有均匀分布在 plot 上。

The second plot (top right), your scale-location plot , is slightly different, as the y axis now uses standardized residuals.第二个 plot(右上角),您的比例位置 plot略有不同,因为 y 轴现在使用标准化残差。 Since these are standardized, they allow one to see if the distance in residuals changes based on location.由于这些是标准化的,它们允许人们查看残差中的距离是否根据位置发生变化。 The red line should be as horizontal as possible again should have values that are as evenly distributed as possible.红线应尽可能水平,并且应具有尽可能均匀分布的值。 Your plot seems to indicate that again this isn't the case.您的 plot 似乎再次表明情况并非如此。

The third plot (bottom left), the QQ plot , tests to see if your residuals are normally distributed by plotting theoretical quantiles by standardized residuals.第三个 plot(左下), QQ plot ,通过用标准化残差绘制理论分位数来测试您的残差是否服从正态分布。 The plotted points should mostly resemble a straight line, with only minor curvature at the ends.绘制的点应该大部分类似于一条直线,两端只有很小的曲率。 It's hard to tell with certainty since the plots are kinda squished together into one window. However, it looks like the residuals appear mostly normal with slight curves on the left (not an issue) and some heavy curves on the right (heavy tails may indicate issues with variation on the right side of your scatterplot).很难确定地说,因为这些图有点被压缩成一个 window。但是,看起来残差看起来大部分是正常的,左边有轻微的曲线(不是问题),右边有一些粗曲线(粗尾可能表明散点图右侧的变化问题)。 To see if this is really damning, run a density plot on your raw residuals and see if they look normal.要查看这是否真的很糟糕,请对原始残差运行密度 plot 并查看它们是否正常。

The last plot (bottom right), your residuals vs leverage plot , checks the leverage of points in your regression as potential outliers.最后一个 plot(右下),你的残差与杠杆 plot ,检查回归中点的杠杆作为潜在异常值。 There are different numbers people have suggested for what is considered "too high", (greater than 1, 4/n, etc.).人们对被认为“太高”的数字提出了不同的建议(大于 1、4/n 等)。 It's best to simply check if some points look way too far away from the others and see if they are causing problematic trends.最好简单地检查一些点是否看起来离其他点太远,看看它们是否导致了有问题的趋势。

By the way, the numbers shown on the points in these plots show you where they are indexed, so you can check them directly.顺便说一句,这些图中点上显示的数字显示了它们的索引位置,因此您可以直接检查它们。 For example, the top most point in the first plot is located in Row 49.例如,第一个 plot 中的最高点位于第 49 行。

For comparison, here are some residuals from Karl Pearson's original father-son height data, which has fairly normal diagnostic plots.为了进行比较,这里有一些来自 Karl Pearson 的原始父子身高数据的残差,这些数据具有相当正常的诊断图。 Notice that the order is slightly different, but interpretation is the same:请注意顺序略有不同,但解释是相同的:

在此处输入图像描述

The Summary摘要

The first part is the formula call, which just specifies how you modeled the regression.第一部分是公式调用,它仅指定您如何对回归建模。 The second part shows how your residuals are distributed.第二部分显示残差的分布方式。 Think of the minimum as the point that strays the furthest below your regression line, the max as the furthest above.将最小值视为偏离回归线下方最远的点,将最大值视为上方最远的点。 Your median should be as close to zero as possible, but so long as it's not some weird number this can be anything so long as it is fairly low.你的中位数应该尽可能接近于零,但只要它不是一些奇怪的数字,只要它相当低,它就可以是任何东西。

The coefficients show your intercept and each of your predictors.系数显示您的截距和每个预测变量。 Listed in order next to them are 1) the slope, which gives the number to be multiplied by their raw values to complete a linear regression equation 2) the standard error, which is how accurate this association is 3) the t value, which is used to test significance, and 4) the p value, which is used as your significance "flag".在它们旁边按顺序列出的是 1) 斜率,它给出了要乘以它们的原始值以完成线性回归方程的数字 2) 标准误差,这是这种关联的准确度 3) t 值,这是用于检验显着性,以及 4) p 值,用作显着性“标志”。 All of your slope coefficients are significant, though not knowing what these predictors mean makes it difficult to interpret them with confidence.您的所有斜率系数都很重要,但不知道这些预测变量的含义会导致难以自信地解释它们。

Below are some model metrics you seem to already know about.以下是您似乎已经知道的一些 model 指标。 Remember that when you have multiple predictors, the adjusted R square should be taken with more weight because it penalizes your regression for overfitting with too many predictors, whereas the normal R square will always increase with more predictors.请记住,当您有多个预测变量时,调整后的 R 方块应该具有更大的权重,因为它会惩罚您因预测变量过多而过度拟合的回归,而正常的 R 方块将始终随着预测变量的增加而增加。 The f statistic and the values with it are used to test if the model as a whole is significant and the residual standard error is an approximation of how accurate the model is in general. f 统计量及其值用于检验 model 作为一个整体是否显着,残差标准误差是 model 一般准确度的近似值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM