简体   繁体   English

从R中的数据帧中提取唯一的组合行

[英]extracting unique combination rows from a data frame in R

I have a data frame that gives pairwise correlations of scores people in the same state had provided. 我有一个数据框,给出了同一个州提供的人的成对相关性。 I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns. 我给出了一个关于我希望如何处理这些数据的小例子,但是现在我的实际数据集有成对相关的1500万行和更多的附加列。

Below is the example data: 以下是示例数据:

>sample_data

Pair_1ID    Pair_2ID    CORR    
1           2           0.12    
1           3           0.23    
2           1           0.12    
2           3           0.75    
3           1           0.23    
3           2           0.75    

I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. 我想生成一个没有重复的新数据帧,例如在第1行中,人1和2之间的相关性是0.12。 Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below: 第1行与第3行相同,它显示了2和1之间的相关性。由于它们具有相同的信息,我希望最终文件没有重复,我想要一个类似下面的文件:

>output 


Pair_1ID    Pair_2ID    CORR
    1        2          0.12
    1        3          0.23
    2        3          0.75

Can someone help? 有人可以帮忙吗? The unique command wont work with this and I don't know how to do it. 独特的命令不适用于此,我不知道该怎么做。

Assuming every combination shows up twice: 假设每个组合出现两次:

subset(sample_data , Pair_1ID <= Pair_2ID)

If not: 如果不:

unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                              Pair_2ID = pmax(Pair_1ID, Pair_2ID)))

Edit : regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. 编辑 :关于最后一个,包括unique CORR因为可能的浮点问题不是一个好主意。 I also see you mention you have a lot more columns. 我也看到你提到你有更多的专栏。 So it would be better to limit the comparison to the two ids: 因此,最好将比较限制为两个ID:

relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                                    Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))

The answer of flodel is really excellent. flodel的答案真的很棒。 I just want to add another solution based on indexing without looking at the actual values. 我只是想在没有查看实际值的情况下添加基于索引的另一种解决方案。 It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example). 它仅在所有组合都存在且数据框由第一列第一列和第二列第二列(如示例中)排序时有效。

maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]

#   Pair_1ID Pair_2ID CORR
# 1        1        2 0.12
# 2        1        3 0.23
# 4        2        3 0.75

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM