[英]How can I find the correlation coefficient between two columns by factors?
I have a large dataframe. 我有一个大的数据框。 I want to calculate the correlation coefficient between hot and index , by class
我想按类别计算hot和index之间的相关系数
ID hot index class
41400 10 2 a
41400 12 2 a
41400 75 4 a
41401 89 5 a
41401 25 3 c
41401 100 6 c
20445 67 4 c
20445 89 6 c
20445 4 1 c
20443 67 5 d
20443 120.2 7 a
20443 140.5 8 d
20423 170.5 10 d
20423 78.1 5 c
Intended output 预期输出
a = 0.X (assumed numbers)
b = 0.Y
c = 0.Z
I know I can use the by command, but I am not able to. 我知道我可以使用by命令,但是我不能。
Code 码
cor_eqn = function(df){
m = cor(hot ~ index, df);
}
by(df,df$class,cor_eqn,simplify = TRUE)
Another option is to use a data.table
instead of a data.frame
. 另一种选择是使用
data.table
而不是data.frame
。 You can just call setDT(df)
on your existing data.frame
(I created a data.table
initially below): 您可以只在现有
data.frame
上调用setDT(df)
(我最初在下面创建了一个data.table
):
library(data.table)
##
set.seed(123)
DT <- data.table(
ID=1:50000,
class=rep(
letters[1:4],
each=12500),
hot=rnorm(50000),
index=rgamma(50000,shape=2))
## set key for better performance
## with large data set
setkeyv(DT,class)
##
> DT[,list(Correlation=cor(hot,index)),by=class]
class Correlation
1: a 0.005658200
2: b 0.001651747
3: c -0.002147164
4: d -0.006248392
You can use dplyr
for this: 您可以为此使用
dplyr
:
library(dplyr)
gp = group_by(dataset, class)
correl = dplyr::summarise(gp, correl = cor(hot, index))
print(correl)
# class correl
# a 0.9815492
# c 0.9753372
# d 0.9924337
Note that class
and df
are R functions, names like these can cause trouble. 注意
class
和df
是R函数,类似这样的名称可能会引起麻烦。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.