简体   繁体   中英

How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.

My data looks like this:

"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...

and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.

I did read this question , however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?

You can't do that with the basic cor() function but you can

  • use a correlation function from one of the robust statistics packages, eg robCov() from package robust

  • use a winsorize() function, eg from robustHD , to treat your data

Here is a quick example for the 2nd approach:

R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y)             # correlation of two unrelated series: almost zero
[1] 0.0312798

The we "contaminate" one point each with a big outlier:

R> x[50] <- y[50] <- 10
R> cor(x,y)             # bigger correlation due to one bad data point
[1] 0.534996

So let's winsorize:

R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R> 

and we're back down to a less correlated measure.

If you apply the same conditional expression for both vectors you could exclude that "point".

cor( DF[2][ DF[2] > 100 ],   # items in 2nd column excluded based on their values
   DF[3][ DF[2] > 100 ] )  # items in 3rd col excluded based on the 2nd col values

In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.

tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])

Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.

It makes sense to put all your data on a data frame, so it's easier to handle. I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.

df       <-  data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))

And then filter out data I don't want before getting into the good analytical stuff.

myFilter <-  with(df, B==T)
df[myFilter, ]

This way, you don't lose track of the outliers, and you are able to manage them as you see fit.

EDIT:

Improving upon my answer above, you could also use conditionals to define the outliers.

df  <-  data.frame(A=c(1,2,15,1,2))
df$B<-  with(df, A > 2)
subset(df, B == F)

You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.

Assuming your dataframe is A and columns are V1 and V2.

WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])

or you can remove several indexes. Let's say 1, 5 and 20

ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM