I have a huge dataset from where I need to matched samples based on some criteria. For example, for every movie star in a location and borough find me two people (random) who are not movie stars. It is 1 for movie star and 0 for non-movie star.
location<- c('manhattan', 'manhattan' ,'manhattan', 'manhattan', 'manhattan', 'manhattan')
moviestar<- c(0,1,0,0,0,1)
id<- c(1,2,3,4,5,6)
borough <- c('williamsburg', 'williamsburg', 'williamsburg', 'williamsburg', 'williamsburg','williamsburg')
df<- data.frame(location,moviestar, borough)
I want to create a subset which has matched pairs of movie star with two other non-movie stars (randomly picked) living in the same location and borough. Any advise?Essentially there are 6 people living in manhattan and there are two stars living in manhattan and I want to match for each star, in this case, 2 and 6 are stars, then I would like to have matched pairs in the final data as follows:
The output I am expecting is like this,
> subset
location moviestar borough id matchpairid
manhattan 1 williamsburg 2 match1
manhattan 0 williamsburg 1 match1
manhttan 0 williamsburg 5 match1
manhattan 1 williamsburg 6 match2
manhattan 0 williamsburg 3 match2
manhttan 0 williamsburg 5 match2
You can get this by counting the number of movie stars and non-movie stars per group, then filtering within each group based on that condition:
library(dplyr)
df %>%
group_by(location) %>%
mutate(num_movie_stars = sum(moviestar),
num_non_movie_stars = sum(1 - moviestar)) %>%
group_by(location, moviestar) %>%
filter(moviestar & row_number() <= num_non_movie_stars / 2 |
!moviestar & row_number() <= num_movie_stars * 2) %>%
ungroup()
In data.table, you could do this with the following
library(data.table)
setDT(df)[df[, keeper := max(moviestar) == 1, by=.(location, borough)][(keeper),
if(any(moviestar == 0)) c(sample(.I[moviestar == 0], 2 * sum(moviestar)),
.I[moviestar == 1]), by=.(location, borough)]$V1
][, keeper := NULL][]
location moviestar borough
1: manhattan 0 williamsburg
2: manhattan 0 williamsburg
3: manhattan 1 williamsburg
keeper is assigned TRUE in the boroughs with moviestars. Then it is used to subset the data. In the second j statement, check if there are any non-moviestars. If yes, sample 2 rows of non-moviestars (using .I
) for every moviestar in the borough, also include movie stars. $V1
extracts these indices. feed this to the original dataset to pull in the results.
keeper := NULL
removes the intermediate keeper variable and []
at the end prints the result.
And a simple no package answer:
starstruck <- function(location, borough, df){
subsamp <- df[which(location == df$location & borough == df$borough),]
stars <- subsamp[subsamp$moviestar == 1,]
nostars <- subsamp[subsamp$moviestar == 0,]
randomcombo <- rbind(stars[sample(nrow(stars), 1, F),],
nostars[sample(nrow(nostars), 2, F),])
randomcombo[order(rownames(randomcombo)),]
}
starstruck("manhattan", "williamsburg", df)
# location moviestar borough
#1 manhattan 0 williamsburg
#2 manhattan 1 williamsburg
#3 manhattan 0 williamsburg
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.