简体   繁体   English

缩小 lm object 的大小以进行绘图

[英]Donwsizing a lm object for plotting

I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly.我想使用 {performance} 中的check_model()但我正在处理数百万个数据点,这使得绘图成本太高。 Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).是否可以从lm() model 采样而不影响其他所有内容(例如,它的系数)。

# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)

# checking model assumptions
performance::check_model(model)

Created on 2022-08-23 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 8 月 23 日创建

Alternative: Is downsizing, ok?替代方案:缩小规模,好吗? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example.例如,在 ML 工作流程中,我将 donwsample 用于调整、特征选择和特征工程。 But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)但我不知道这在经典线性回归建模中是否常见(可以在缩小样本中测试异方差性,然后用全样本估计系数吗?)

Speeding up check_model加快check_model

The documentation ( ?check_model ) explains a few things you can do to speed up the function/plotting without subsampling:文档( ?check_model )解释了一些你可以做的事情来加速函数/绘图而无需二次采样:

For models with many observations, or for more complex models in general, generating the plot might become very slow.对于具有许多观察值的模型,或更复杂的模型,生成 plot 可能会变得非常缓慢。 One reason might be that the underlying graphic engine becomes slow for plotting many data points.一个原因可能是底层图形引擎在绘制许多数据点时变得很慢。 In such cases, setting the argument show_dots = FALSE might help.在这种情况下,设置参数 show_dots = FALSE 可能会有所帮助。 Furthermore, look at the check argument and see if some of the model checks could be skipped, which also increases performance.此外,查看 check 参数,看看是否可以跳过某些 model 检查,这也提高了性能。

Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE) .因此,您可以使用check_model(model, show_dots = FALSE)关闭每点默认值。 You can also choose the specific checks you get (reducing computation time) if you are not interested in them.如果您对它们不感兴趣,您还可以选择获得的特定检查(减少计算时间)。 For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check") .例如,您可以使用check_model(model, check = "pp_check")后验预测分布中获取样本。

Implications of Downsampling下采样的含义

Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data.选择观察的子集(和/或如果您使用贝叶斯模型,则从后验中提取)将始终改变任何以数据为条件的结果。 Both your model parameters and post-estimation summaries conditioning on the data will change.您的 model 参数和基于数据的估计后摘要都会发生变化。 Just how much it will change depends on variability of your observations and sample size.它将改变多少取决于您的观察结果和样本量的可变性。 With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.有了数百万次观察,它可能不太可能发生太大变化——但也许一些罕见的数据模式会在(后)估计期间严重影响您的结果。

Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little.基于与您估计的不同的 model 绘制异方差性几乎没有意义,但您的里程可能会有所不同,因为模型可能差异不大。 You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does.您正在评估您的 model 与高斯马尔可夫方差假设的近似程度,而不是另一个 model 的近似程度。 From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.从计算的角度来看,这样做也令人费解:残差是估计的一部分——如果你能拟合 model,你大概也可以以各种方式显示残差。

That being said, these plots are also approximations to the actual distribution of interest anyway (ie you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.话虽如此,无论如何,这些图也是实际兴趣分布的近似值(即,您正在使用其中一些图隐式估计测试统计数据),并且由于中心极限定理适用,如果您删掉一些,事情看起来大致相同给定您的数据足够大的观察结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM