简体   繁体   中英

finding matched pairs(or records) in a dataset

I have a huge dataset from where I need to matched samples based on some criteria. For example, for every movie star in a location and borough find me two people (random) who are not movie stars. It is 1 for movie star and 0 for non-movie star.

 location<- c('manhattan', 'manhattan' ,'manhattan', 'manhattan', 'manhattan', 'manhattan')
 moviestar<- c(0,1,0,0,0,1)
 id<- c(1,2,3,4,5,6)
 borough <- c('williamsburg', 'williamsburg', 'williamsburg', 'williamsburg', 'williamsburg','williamsburg')

  df<- data.frame(location,moviestar, borough)

I want to create a subset which has matched pairs of movie star with two other non-movie stars (randomly picked) living in the same location and borough. Any advise?Essentially there are 6 people living in manhattan and there are two stars living in manhattan and I want to match for each star, in this case, 2 and 6 are stars, then I would like to have matched pairs in the final data as follows:

The output I am expecting is like this,

  > subset 
  location moviestar borough       id matchpairid
  manhattan    1    williamsburg   2  match1
  manhattan    0    williamsburg   1  match1
  manhttan     0    williamsburg   5  match1
  manhattan    1    williamsburg   6  match2
  manhattan    0    williamsburg   3  match2
  manhttan     0    williamsburg   5  match2

You can get this by counting the number of movie stars and non-movie stars per group, then filtering within each group based on that condition:

library(dplyr)
df %>%
  group_by(location) %>%
  mutate(num_movie_stars = sum(moviestar),
         num_non_movie_stars = sum(1 - moviestar)) %>%
  group_by(location, moviestar) %>%
  filter(moviestar & row_number() <= num_non_movie_stars / 2 |
         !moviestar & row_number() <= num_movie_stars * 2) %>%
  ungroup()

In data.table, you could do this with the following

library(data.table)

setDT(df)[df[, keeper := max(moviestar) == 1, by=.(location, borough)][(keeper),
            if(any(moviestar == 0)) c(sample(.I[moviestar == 0], 2 * sum(moviestar)),
                                             .I[moviestar == 1]), by=.(location, borough)]$V1
          ][, keeper := NULL][]

    location moviestar      borough
1: manhattan         0 williamsburg
2: manhattan         0 williamsburg
3: manhattan         1 williamsburg

keeper is assigned TRUE in the boroughs with moviestars. Then it is used to subset the data. In the second j statement, check if there are any non-moviestars. If yes, sample 2 rows of non-moviestars (using .I ) for every moviestar in the borough, also include movie stars. $V1 extracts these indices. feed this to the original dataset to pull in the results.

keeper := NULL removes the intermediate keeper variable and [] at the end prints the result.

And a simple no package answer:

starstruck <- function(location, borough, df){
  subsamp <- df[which(location == df$location & borough == df$borough),]
  stars <- subsamp[subsamp$moviestar == 1,]
  nostars <- subsamp[subsamp$moviestar == 0,]
  randomcombo <- rbind(stars[sample(nrow(stars), 1, F),], 
                       nostars[sample(nrow(nostars), 2, F),])
  randomcombo[order(rownames(randomcombo)),]
}

starstruck("manhattan", "williamsburg", df)
#   location moviestar      borough
#1 manhattan         0 williamsburg
#2 manhattan         1 williamsburg
#3 manhattan         0 williamsburg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM