[英]R calculate the correlation coefficient
I have a data frame with 3 variables "age", "confidence" and countryname". I want to campare the correlation between age and confidence in different countries. So I write the following commands to calcuate the correlation coefficient.我有一个包含 3 个变量“年龄”、“信心”和“国名”的数据框。我想比较不同国家的年龄和信心之间的相关性。所以我写了以下命令来计算相关系数。
correlate <- evs%>%group_by(countryname) %>% summarise(c=cor(age,confidence))
But i found that there are a lot missing value in the output "c".但是我发现输出“c”中有很多缺失值。 i'm wondering is that mean there are little correlation between IV and DV for this countries, or is there something wrong with my commands?
我想知道这是否意味着这些国家的 IV 和 DV 之间几乎没有相关性,还是我的命令有问题?
An NA
in the correlation matrix means that you have NA
values (ie missing values) in your observations.相关矩阵中的
NA
意味着您的观察中有NA
值(即缺失值)。 The default behaviour of cor
is to return a correlation of NA
"whenever one of its contributing observations is NA" (from the manual). cor
的默认行为是返回NA
的相关性“每当它的一个贡献观察是 NA”(来自手册)。
That means that a single NA
in the date will give a correlation NA
even when you only have one NA
among a thousand useful data sets.这意味着即使在一千个有用的数据集中只有一个
NA
,日期中的单个NA
也会给出相关性NA
。
What you can do from here:你可以从这里做什么:
use
when calling cor
.cor
时添加参数use
。 This way you specify how the algorithm shall handle missing values.?cor
) to find out what options you have.?cor
)以了解您有哪些选项。 In your case I would just use use="complete.obs"
.use="complete.obs"
。 With only 2 variables, most (but not all) options will yield the same result. Some more explanation:还有一些解释:
age <- 18:35
confidence <- (age - 17) / 10 + rnorm(length(age))
cor(age, confidence)
#> [1] 0.3589942
Above is the correlation with all the data.以上是与所有数据的相关性。 Now lets set a few NAs and try again:
现在让我们设置一些 NA 并重试:
confidence[c(1, 6, 11, 16)] <- NA
cor(age, confidence) # use argument will implicitely be "everything".
#> [1] NA
This gives NA
because some confidence values are NA
.这给出了
NA
因为一些置信值是NA
。 The next statement still gives a result:下一个语句仍然给出一个结果:
cor(age, confidence, use="complete.obs")
#> [1] 0.3130549
Created on 2021-10-16 by the reprex package (v2.0.1)由reprex 包(v2.0.1) 于 2021 年 10 月 16 日创建
I know two ways of calculation in R;我知道 R 中的两种计算方式;
Calculation with the built-in cor() function:使用内置的cor()函数进行计算:
# importing df: state_crime <- read.csv("~/Documents/R/state_crime.csv") # checking colnames: colnames(state_crime) [1] "state" "year" "population" [4] "murder_rate" # correlation coefficient between population and murder rate: cor(state_crime$population, state_crime$murder_rate, method = "pearson")
[1] -0.0322388
Manual calculation with code:用代码手动计算:
# creating columns for "deviation from the mean" for both variables: state_crime <- state_crime %>% mutate(dev_mean_murderrate = (state_crime$murder_rate - mean(murder_rate))) %>% mutate(dev_mean_population = (state_crime$population - mean(population))) %>% data.frame() # implementing the formula: r=∑(x−mx)(y−my)∑(x−mx)2∑(y−my)2 sum(state_crime$dev_mean_population * state_crime$dev_mean_murderrate) / sqrt(sum((state_crime$murder_rate - mean(state_crime$murder_rate))**2) * sum((state_crime$population - mean(state_crime$population))**2) )
[1] -0.0322388
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.