简体   繁体   中英

Finding unique pairs in R (with no repetition of ANY values)

I have a dataframe containing blood results for 2 cohorts of patients (groups x and y). There are the same number of patients (with distinct id numbers) in each cohort (2000). They have been (fuzzy) joined by the time of the test as I am interested in timings. As there were many matches on test time, this has led to the ids being duplicated in both groups where timings are similar.

Here is an example:

id.x  time  value  id.y  time  value
1      23    4.1   11     18   4.3
1      23    4.1   12     25   4.8
2      54    3.9   13     51   4.3
2      54    3.9   14     52   4.0
3      72    4.5   14     70   4.3
3      72    4.5   15     25   4.3

There is a 1:1 ratio of id numbers in x and y groups.

Attempts so far

I am trying to find unique pairs, where no id is repeated in either column. I have tried

test.df %>% distinct(id.x, .keep_all = TRUE)

This half works - I get unique values from id.x, but they are matched with id.y values that are repeated, as I haven't specified that these also must be unique.

I have also tried

testsample.df <- unique(test.df[,c('icustay_id.x', 'icustay_id.y')])

This results in unique pairs ie each pair is not repeated, but each id is repeated many times in all possible combinations.

I'm not sure if this is even possible.

Partial success

The closest I've come is a 2-stage process - subsampling with random unique id.y then subsampling this with unique id.x using the following:

subsampled_data <- ddply(test.df,.(test.df$id.y), function(x) 
{x[sample(nrow(x),size=1),]})

subsampled_data2.df <- ddply(subsampled_data,.(subsampled_data$icustay_id.x), 
function(x) {x[sample(nrow(x),size=1),]})

Doing this, I successfully end up with unique pairs with only unique ids. However I lose quite a few rows, going from 2000, to ~1000.

Is it possible to find the unique pairs with unique ids without losing so many at each step?

Thanks!

I'm not sure what your expected output is. But hope this helps.

Sample data:

df <- data.frame(id.x = c(1,1,2,2,3,3), id.y = c(11,12,13,13,13,13))

Unique values of id.x :

df <- df[!duplicated(df$id.x), ]

  id.x id.y
1    1   11
3    2   13
5    3   13

Now we still have duplicates in id.y and do

df <- df[!duplicated(df$id.y), ]

to remove them.

This leads to:

  id.x id.y
1    1   11
3    2   13

Or with dplyr :

df %>% distinct(id.x, .keep_all = TRUE) %>% distinct(id.y, .keep_all = TRUE)

returns:

  id.x id.y
1    1   11
2    2   13

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM