简体   繁体   中英

Donwsizing a lm object for plotting

I'd like to use check_model() from {performance} but I'm working with a few millions datapoints, which make plotting too costly. Is it possible to take a sample from a lm() model without affecting everything else (eg., it's coefficients).

# defining a model
model = lm(mpg ~ wt + am + gear + vs * cyl, data = mtcars)

# checking model assumptions
performance::check_model(model)

Created on 2022-08-23 by the reprex package (v2.0.1)

Alternative: Is downsizing, ok? In a ML workflow I'd donwsample for tunning, feature selection and feature engineering, for example. But I don't know if that's usual in classic linear regression modelling (is OK to test for heteroskedasticity in a downsized sample and then estimate the coefficients with full sample?)

Speeding up check_model

The documentation ( ?check_model ) explains a few things you can do to speed up the function/plotting without subsampling:

For models with many observations, or for more complex models in general, generating the plot might become very slow. One reason might be that the underlying graphic engine becomes slow for plotting many data points. In such cases, setting the argument show_dots = FALSE might help. Furthermore, look at the check argument and see if some of the model checks could be skipped, which also increases performance.

Accordingly, you can turn off the dots-per-point default with check_model(model, show_dots = FALSE) . You can also choose the specific checks you get (reducing computation time) if you are not interested in them. For example, you could get only samples from the posterior predictive distribution with check_model(model, check = "pp_check") .

Implications of Downsampling

Choosing a subset of observations (and/or draws from the posterior if you're using a Bayesian model) will always change the results of anything that conditions on the data. Both your model parameters and post-estimation summaries conditioning on the data will change. Just how much it will change depends on variability of your observations and sample size. With millions of observations, it's probably unlikely to change much -- but maybe some rare data patterns can heavily influence your results during (post)-estimation.

Plotting for heteroskedasticity based on a different model than the one you estimated makes little sense, but your mileage may vary because the models may differ little. You're looking to evaluate how well your model approximates the Gauss-Markov variance assumptions, not how well another model does. From a computational perspective, it would also be puzzling to do so: the residuals are part of estimation -- if you can fit the model, you can presumably also show the residuals in various ways.

That being said, these plots are also approximations to the actual distribution of interest anyway (ie you're implicitly estimating test statistics with some of these plots) and since the central limit theorem applies, things would look the same roughly if you cut out some observations given your data are sufficiently large.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM