简体   繁体   English

如何按因子查找两列之间的相关系数?

[英]How can I find the correlation coefficient between two columns by factors?

I have a large dataframe. 我有一个大的数据框。 I want to calculate the correlation coefficient between hot and index , by class 我想按类别计算hotindex之间的相关系数

ID    hot   index class
41400 10      2   a
41400 12      2   a
41400 75      4   a
41401 89      5   a 
41401 25      3   c
41401 100     6   c
20445 67      4   c
20445 89      6   c
20445 4       1   c
20443 67      5   d
20443 120.2   7   a
20443 140.5   8   d
20423 170.5   10   d
20423 78.1    5   c

Intended output 预期输出

a = 0.X (assumed numbers)
b = 0.Y
c = 0.Z

I know I can use the by command, but I am not able to. 我知道我可以使用by命令,但是我不能。

Code

cor_eqn = function(df){
  m = cor(hot ~ index, df);

}

by(df,df$class,cor_eqn,simplify = TRUE)

Another option is to use a data.table instead of a data.frame . 另一种选择是使用data.table而不是data.frame You can just call setDT(df) on your existing data.frame (I created a data.table initially below): 您可以只在现有data.frame上调用setDT(df) (我最初在下面创建了一个data.table ):

library(data.table)
##
set.seed(123)
DT <- data.table(
  ID=1:50000,
  class=rep(
    letters[1:4],
    each=12500),
  hot=rnorm(50000),
  index=rgamma(50000,shape=2))
## set key for better performance 
## with large data set
setkeyv(DT,class)
##
> DT[,list(Correlation=cor(hot,index)),by=class]
   class  Correlation
1:     a  0.005658200
2:     b  0.001651747
3:     c -0.002147164
4:     d -0.006248392

You can use dplyr for this: 您可以为此使用dplyr

library(dplyr)
gp = group_by(dataset, class)
correl = dplyr::summarise(gp, correl = cor(hot, index))
print(correl)

#  class   correl
#   a      0.9815492
#   c      0.9753372
#   d      0.9924337

Note that class and df are R functions, names like these can cause trouble. 注意classdf是R函数,类似这样的名称可能会引起麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM