简体   繁体   中英

Confusion about calculating sample correlation in r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.

I calculate the sample correlation with

cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))

Which gives me the result 0.5693599

Then I try using R's cov() function

cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")

which gives me the result:

          Nload     Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000

Which is a different result. Can anyone see where I've gone wrong?

This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:

x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)

So here we have

df[seq(2), ]
           x         y
1  1.0879645        NA
2 -0.3919369 0.2191193

We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.

cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583

cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

But if you remove the first row from your computation, the result is equal

df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM