简体   繁体   English

如何告诉R从相关计算中删除异常值?

[英]How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? 在计算相关性时如何告诉R删除异常值? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. 我从散点图中发现了一个潜在的异常值,并且我试图比较有和没有这个值的相关性。 This is for an intro stats course; 这是一个介绍统计课程; I am just playing with this data to start understanding correlation and outliers. 我只是在玩这些数据来开始理解相关性和异常值。

My data looks like this: 我的数据如下:

"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...

and so on, for 26 lines of data. 等等,对于26行数据。 I am trying to find the correlation of the first and second numbers. 我试图找到第一个和第二个数字的相关性。

I did read this question , however, I am only trying to remove a single point, not a percentage of points. 确实读过这个问题 ,但是,我只想删除一个点,而不是一个百分点。 Is there a command in R to do this? R中是否有命令执行此操作?

You can't do that with the basic cor() function but you can 你不能用基本的cor()函数做到这一点,但你可以

  • use a correlation function from one of the robust statistics packages, eg robCov() from package robust 使用来自其中一个健壮统计软件包的相关函数,例如来自包鲁棒的 robCov()

  • use a winsorize() function, eg from robustHD , to treat your data 使用winsorize()函数,例如来自robustHD ,来处理您的数据

Here is a quick example for the 2nd approach: 以下是第二种方法的快速示例:

R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y)             # correlation of two unrelated series: almost zero
[1] 0.0312798

The we "contaminate" one point each with a big outlier: 我们用一个大的异常值“污染”一个点:

R> x[50] <- y[50] <- 10
R> cor(x,y)             # bigger correlation due to one bad data point
[1] 0.534996

So let's winsorize: 所以让我们来赢取:

R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R> 

and we're back down to a less correlated measure. 而且我们又回到了一个不太相关的衡量标准。

If you apply the same conditional expression for both vectors you could exclude that "point". 如果对两个向量应用相同的条件表达式,则可以排除该“点”。

cor( DF[2][ DF[2] > 100 ],   # items in 2nd column excluded based on their values
   DF[3][ DF[2] > 100 ] )  # items in 3rd col excluded based on the 2nd col values

In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). 在下面,我从假设(我在你的行之间阅读)开始工作,你已经在视觉上确定了单个异常值(即,从图中)。 From your limited data set it's probably easy to identify that point based on its value. 根据您的有限数据集,可能很容易根据其值识别该点。 If you have more data points, you could use something like this. 如果你有更多的数据点,你可以使用这样的东西。

tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])

Note that I just copied my own code because I couldn't work from sample code from you. 请注意,我只是复制了自己的代码,因为我无法使用您的示例代码。 Also check ?identify before. 还要检查?identify之前。

It makes sense to put all your data on a data frame, so it's easier to handle. 将所有数据放在数据框上是有意义的,因此更容易处理。 I always like to keep track of outliers by using an extra column (in this case, B) in my data frame. 我总是喜欢通过在我的数据框中使用额外的列(在本例中为B)来跟踪异常值。

df       <-  data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))

And then filter out data I don't want before getting into the good analytical stuff. 然后在进入好的分析之前过滤掉我不想要的数据。

myFilter <-  with(df, B==T)
df[myFilter, ]

This way, you don't lose track of the outliers, and you are able to manage them as you see fit. 这样,您就不会忘记异常值,并且您可以根据需要管理它们。

EDIT: 编辑:

Improving upon my answer above, you could also use conditionals to define the outliers. 改进上面的答案,您还可以使用条件来定义异常值。

df  <-  data.frame(A=c(1,2,15,1,2))
df$B<-  with(df, A > 2)
subset(df, B == F)

You are getting some great and informative answers here, but they seem to be answers to more complex questions. 你在这里得到了一些很好的信息,但它们似乎是更复杂问题的答案。 Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. 如果我错了,请纠正我,但听起来你只想手动删除一个观察。 Specifying the negative of its index will remove it. 指定其索引的否定将删除它。

Assuming your dataframe is A and columns are V1 and V2. 假设您的数据帧是A,列是V1和V2。

WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])

or you can remove several indexes. 或者您可以删除多个索引。 Let's say 1, 5 and 20 让我们说1,5和20

ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM