How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question , however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?
You can't do that with the basic cor()
function but you can
use a correlation function from one of the robust statistics packages, eg robCov()
from package robust
use a winsorize()
function, eg from robustHD , to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.
If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values
In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify
before.
It makes sense to put all your data on a data frame, so it's easier to handle. I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)
You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.
Assuming your dataframe is A and columns are V1 and V2.
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])
or you can remove several indexes. Let's say 1, 5 and 20
ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.