简体   繁体   English

R计算相关系数

[英]R calculate the correlation coefficient

I have a data frame with 3 variables "age", "confidence" and countryname". I want to campare the correlation between age and confidence in different countries. So I write the following commands to calcuate the correlation coefficient.我有一个包含 3 个变量“年龄”、“信心”和“国名”的数据框。我想比较不同国家的年龄和信心之间的相关性。所以我写了以下命令来计算相关系数。

correlate <- evs%>%group_by(countryname) %>% summarise(c=cor(age,confidence))  

But i found that there are a lot missing value in the output "c".但是我发现输出“c”中有很多缺失值。 i'm wondering is that mean there are little correlation between IV and DV for this countries, or is there something wrong with my commands?我想知道这是否意味着这些国家的 IV 和 DV 之间几乎没有相关性,还是我的命令有问题?

An NA in the correlation matrix means that you have NA values (ie missing values) in your observations.相关矩阵中的NA意味着您的观察中有NA值(即缺失值)。 The default behaviour of cor is to return a correlation of NA "whenever one of its contributing observations is NA" (from the manual). cor的默认行为是返回NA的相关性“每当它的一个贡献观察是 NA”(来自手册)。

That means that a single NA in the date will give a correlation NA even when you only have one NA among a thousand useful data sets.这意味着即使在一千个有用的数据集中只有一个NA日期中的单个NA也会给出相关性NA

What you can do from here:你可以从这里做什么:

  1. You should investigate these NAs, count it and determine if your data set contains enough usable data.您应该调查这些 NA,对其进行计数并确定您的数据集是否包含足够的可用数据。 Find out which variables are affected by NAs and to what extent.找出哪些变量受 NA 影响以及影响程度。
  2. Add the argument use when calling cor .在调用cor时添加参数use This way you specify how the algorithm shall handle missing values.通过这种方式,您可以指定算法如何处理缺失值。 Check out the manual (with ?cor ) to find out what options you have.查看手册(带有?cor )以了解您有哪些选项。 In your case I would just use use="complete.obs" .在你的情况下,我只会使用use="complete.obs" With only 2 variables, most (but not all) options will yield the same result.只有 2 个变量,大多数(但不是全部)选项将产生相同的结果。

Some more explanation:还有一些解释:

age <- 18:35
confidence <- (age - 17) / 10 + rnorm(length(age))
cor(age, confidence)
#> [1] 0.3589942

Above is the correlation with all the data.以上是与所有数据的相关性。 Now lets set a few NAs and try again:现在让我们设置一些 NA 并重试:

confidence[c(1, 6, 11, 16)] <- NA
cor(age, confidence) # use argument will implicitely be "everything".
#> [1] NA

This gives NA because some confidence values are NA .这给出了NA因为一些置信值是NA The next statement still gives a result:下一个语句仍然给出一个结果:

cor(age, confidence, use="complete.obs")
#> [1] 0.3130549

Created on 2021-10-16 by the reprex package (v2.0.1)reprex 包(v2.0.1) 于 2021 年 10 月 16 日创建

I know two ways of calculation in R;我知道 R 中的两种计算方式;

  • via built-in cor() function,通过内置的 cor() 函数,
  • manual calculation with code用代码手动计算

Calculation with the built-in cor() function:使用内置的cor()函数进行计算:

 # importing df: state_crime <- read.csv("~/Documents/R/state_crime.csv") # checking colnames: colnames(state_crime) [1] "state" "year" "population" [4] "murder_rate" # correlation coefficient between population and murder rate: cor(state_crime$population, state_crime$murder_rate, method = "pearson")
 [1] -0.0322388

Manual calculation with code:用代码手动计算:

 # creating columns for "deviation from the mean" for both variables: state_crime <- state_crime %>% mutate(dev_mean_murderrate = (state_crime$murder_rate - mean(murder_rate))) %>% mutate(dev_mean_population = (state_crime$population - mean(population))) %>% data.frame() # implementing the formula: r=∑(x−mx)(y−my)∑(x−mx)2∑(y−my)2 sum(state_crime$dev_mean_population * state_crime$dev_mean_murderrate) / sqrt(sum((state_crime$murder_rate - mean(state_crime$murder_rate))**2) * sum((state_crime$population - mean(state_crime$population))**2) )
 [1] -0.0322388

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM