简体   繁体   中英

How to generate OUTLIER-FREE data in R?

I would like to know how can I generate an OUTLIER-FREE data using R. I'm generating data using RNORM.

Say I have a linear equation

   Y = B0 + B1*X + E,     where X~N(5,9) and E~N(0,1).

I'm going to use RNORM in generating X and E. Below are the codes used:

  X <- rnorm(50,5,3)       #I'm generating 50 Xi's w/ mean=5 & var=9
  E <- rnorm(50,0,1)       #I'm generating 50 residuals w/ mean=0 & var=1

Now, I'm going to generate Y by plugging the generated data on X & E above in the linear equation.

If the data I've generated above is outlier-free (no influential observation), then no Cook's Distance of observations should exceed 4/n, which is the usual cut-off for detecting influential/outlying observations.

But I wasn't not able to get this so far. I'm still getting outliers once I generate data following this procedure.

Can you help me out on this? Do you know a way how can I generate data which is OUTLIER-FREE.

Thanks a lot!

Well, one way would be to detect and delete those outliers by finding the generated points that exceed some cutoff. Of course this would harm the "randomness" in your generated data but your request for outlier-free data implies that by definition. Possibly, decreasing the variance of X could also help.

Is there a particular reason you need the X's to be normally distributed? The assumption of normality in regression is for the residuals (the error term). Typically the measured independent variable won't be normally distributed -- in a balanced, (quasi-)experimental setup, the X's should be close to uniformly distributed. A uniform distribution for the X's (or even an evenly divided sequence generated with seq() ) would help you here because the "outlierness" of outliers arises from being both being far from the center from the sample space and being comparatively few in number. With a uniform distribution, they are no longer few in number, which reduces their leverage.

As a sidebar: real-data has outliers. This is actually one of the ways we can detect touched-up or even faked data in science. If you're interested in simulations that correspond to something in reality, then outliers may not be a bad thing. And there is a whole world of robust methods for dealing with data with arbitrarily bad outliers in a principled way as opposed to arbitrary cutoff points.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM