简体   繁体   中英

extracting unique combination rows from a data frame in R

I have a data frame that gives pairwise correlations of scores people in the same state had provided. I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.

Below is the example data:

>sample_data

Pair_1ID    Pair_2ID    CORR    
1           2           0.12    
1           3           0.23    
2           1           0.12    
2           3           0.75    
3           1           0.23    
3           2           0.75    

I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:

>output 


Pair_1ID    Pair_2ID    CORR
    1        2          0.12
    1        3          0.23
    2        3          0.75

Can someone help? The unique command wont work with this and I don't know how to do it.

Assuming every combination shows up twice:

subset(sample_data , Pair_1ID <= Pair_2ID)

If not:

unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                              Pair_2ID = pmax(Pair_1ID, Pair_2ID)))

Edit : regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:

relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
                                    Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))

The answer of flodel is really excellent. I just want to add another solution based on indexing without looking at the actual values. It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example).

maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]

#   Pair_1ID Pair_2ID CORR
# 1        1        2 0.12
# 2        1        3 0.23
# 4        2        3 0.75

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM