简体   繁体   中英

How to identify an outlier from a data set in R

So I am new to R and I am currently trying to identify outliers in a set of data, so far I have inputted into R,

lifespan_yrs<- c(38.6, 4.5, 14, 8, 69, 27, 19, 30.4, 28, 50, 7, 30, 3.5,
 40, 3.5, 50, 6, 10.4, 34, 7, 28, 20, 3.9, 39.3, 41, 16.2, 9, 7.6, 46, 22.4, 
 16.3, 2.6, 24, 100, 13, 10, 3.2, 2, 5, 6.5, 23.6, 12, 20.2, 13, 27, 18, 13.7, 
 4.7, 9.8, 29, 7, 6, 17, 20, 12.7, 3.5, 4.5, 7.5, 2.3, 24, 3, 13)

gestation_days<- c(645, 42, 60, 25, 624, 180, 35, 392, 63, 230, 112, 281, 35, 
365, 42, 28, 42, 120, 75, 122, 400, 148, 16, 252, 310, 63, 28, 68, 336, 100, 33, 
 21.5, 50, 267, 30, 45, 19, 30, 12, 120, 440, 140, 170, 17, 115, 31, 63, 21, 52, 
164, 225, 225, 150, 151, 90, 45, 60, 200, 46, 210, 14, 38)

lifespan_yrs

gestation_days

plot(gestation_days,lifespan_yrs)

And I have a plot of this data, however the next part of the question says "investigate this plot and discuss any data points that merit investigation" I am taking this to mean are there any outliers in the data (I am not sure what definition of an outlier i can/should use) and then is there a way on R to investigate the data points in such a way? Please use simple language to explain this as, again, I am new to R.

Thank you! Mollie x

Well, "outlier" means only "something which has low probability under an assumed model for the data". The simplest assumption is that the data are normally distributed. Low probability for normally distributed data means anything in the tails. In the tails means data that are more than a few (let's say two) standard deviations away from the mean.

So this leads to a pretty simple procedure. Calculate the mean via the R function mean and the standard deviation via sd . Then look at any points which are less than mean minus twice the sd or more than mean plus twice the sd These will be a few at the left tail and a few at the right tail. Is there something interesting about these data? That's what your instructor is asking.

Of course, what counts as an outlier depends entirely on the model assumed for the data -- if you change the model, you'll change the outliers. So it's important to spell out what your model is, and be prepared to change it if somebody (eg your instructor) suggests a different one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM