简体   繁体   English

如何从R中的数据集中识别离群值

[英]How to identify an outlier from a data set in R

So I am new to R and I am currently trying to identify outliers in a set of data, so far I have inputted into R, 因此,我是R的新手,我目前正在尝试识别一组数据中的异常值,到目前为止,我已经输入了R,

lifespan_yrs<- c(38.6, 4.5, 14, 8, 69, 27, 19, 30.4, 28, 50, 7, 30, 3.5,
 40, 3.5, 50, 6, 10.4, 34, 7, 28, 20, 3.9, 39.3, 41, 16.2, 9, 7.6, 46, 22.4, 
 16.3, 2.6, 24, 100, 13, 10, 3.2, 2, 5, 6.5, 23.6, 12, 20.2, 13, 27, 18, 13.7, 
 4.7, 9.8, 29, 7, 6, 17, 20, 12.7, 3.5, 4.5, 7.5, 2.3, 24, 3, 13)

gestation_days<- c(645, 42, 60, 25, 624, 180, 35, 392, 63, 230, 112, 281, 35, 
365, 42, 28, 42, 120, 75, 122, 400, 148, 16, 252, 310, 63, 28, 68, 336, 100, 33, 
 21.5, 50, 267, 30, 45, 19, 30, 12, 120, 440, 140, 170, 17, 115, 31, 63, 21, 52, 
164, 225, 225, 150, 151, 90, 45, 60, 200, 46, 210, 14, 38)

lifespan_yrs

gestation_days

plot(gestation_days,lifespan_yrs)

And I have a plot of this data, however the next part of the question says "investigate this plot and discuss any data points that merit investigation" I am taking this to mean are there any outliers in the data (I am not sure what definition of an outlier i can/should use) and then is there a way on R to investigate the data points in such a way? 我有一个数据图,但是问题的下一部分说“调查此图并讨论值得调查的任何数据点”,我的意思是数据中是否存在异常值(我不确定该定义什么我可以/应该使用的离群值),然后在R上是否有办法以这种方式调查数据点? Please use simple language to explain this as, again, I am new to R. 请再次用简单的语言进行解释,因为我是R的新手。

Thank you! 谢谢! Mollie x 莫莉x

Well, "outlier" means only "something which has low probability under an assumed model for the data". 好吧,“异常值”仅表示“在假定的数据模型下概率较低的事物”。 The simplest assumption is that the data are normally distributed. 最简单的假设是数据是正态分布的。 Low probability for normally distributed data means anything in the tails. 正态分布数据的低概率意味着尾巴中的任何东西。 In the tails means data that are more than a few (let's say two) standard deviations away from the mean. 尾部表示与均值相差多个(例如两个)标准偏差的数据。

So this leads to a pretty simple procedure. 因此,这导致了一个非常简单的过程。 Calculate the mean via the R function mean and the standard deviation via sd . 通过R函数mean计算mean并通过sd计算标准偏差。 Then look at any points which are less than mean minus twice the sd or more than mean plus twice the sd These will be a few at the left tail and a few at the right tail. 然后查看小于均值减去标准差两倍的标准差或大于均值加上标准差两倍的标准差的所有点。这些点在左尾将是几个,在右尾将是几个。 Is there something interesting about these data? 这些数据是否有趣? That's what your instructor is asking. 那就是您的教练要问的。

Of course, what counts as an outlier depends entirely on the model assumed for the data -- if you change the model, you'll change the outliers. 当然,什么是离群值完全取决于为数据假设的模型-如果更改模型,则将更改离群值。 So it's important to spell out what your model is, and be prepared to change it if somebody (eg your instructor) suggests a different one. 因此,重要的是要弄清楚您的模型是什么,并准备好在有人(例如您的讲师)建议使用其他模型时进行更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM