简体   繁体   English

如何根据整个数据帧中的 id 计算变量的对间相关性?

[英]How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

I have a twin-dataset, in which there is one column called wpsum , another column is family-id , which is the same for corresponding twin pairs.我有一个孪生数据集,其中有一列名为wpsum ,另一列是family-id ,这对于相应的孪生对是相同的。

        wpsum    family-id
twin 1     14          220    
twin 2     18          220

I want to calculate the correlation between wpsum of those with the same family-id, while there are also some single family id's , if one twin did not take part in the re-survey.我想计算那些具有相同家庭 ID 的人的wpsum之间的相关性,而如果一个双胞胎没有参加重新调查,那么也有一些单身family id's family-id is a character. family-id是一个字符。

There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups.正如您所说,具有相同家庭 ID 的wpsum之间没有相关性,主要是因为在family-id组中没有与wpsum相关wpsum第三个变量(请参阅我的评论),但是您可以获得wpsum的差异组内得分。 Maybe that's what you meant by correlation.也许这就是你所说的相关性。 Here's how to get those (I changed and expanded your example):以下是获取这些内容的方法(我更改并扩展了您的示例):

dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1), 
              family_id = c("220","220","221","221","222","222","223"))
dat
  wpsum family_id
1    14       220
2    18       220
3    20       221
4     5       221
5    10       222
6    NA       222
7     1       223

diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------ 
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA

You can make a data.frame with this new variable of differences like so:您可以使用这个新的差异变量创建一个 data.frame,如下所示:

diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
  diffs family_id
1     4       220
2    15       221
3    NA       222
4    NA       223

Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error.请注意,这里的缺失值和缺失观测值都不是(编码)问题——它们只会导致没有错误的差异缺失。 If you started having more than two observations within each family ID, though, then you'd need to do something different.但是,如果您在每个家庭 ID 中开始有两个以上的观察,那么您需要做一些不同的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM