简体   繁体   中英

Sampling from a subset of a dataframe where the subset is conditional on a value from another dataframe in R

I have two data frames in R . One contains a row for each individual person and the area they live in. Eg

df1 = data.frame(Person_ID = seq(1,10,1), Area = c("A","A","A","B","B","C","D","A","D","C"))

The other data frame contains demographic information for each Area .

Eg for gender df2 = data.frame(Area = c("A","A","B","B","C","C","D","D"), gender = c("M","F","M","F","M","F","M","F"), probability = c(0.4,0.6,0.55,0.45,0.6,0.4,0.5,0.5))

In df1 I want to create a gender column where for each row of df1 I sample a gender from the appropriate subset of df2 . For example, for row 1 of df1 I would sample a gender from df2 %>% filter(Area == "A")

The question is how do I do this for all rows without a for loop as in practice df1 could have up to 5 million rows?

Try using the following:

library(dplyr)
library(tidyr)

out <- df1 %>%
         nest(data = -Area) %>%
         left_join(df2, by = 'Area') %>%
         group_by(Area) %>%
         summarise(data = map(data, ~.x %>% 
                                      mutate(gender = sample(gender, n(), 
                                      prob = probability, replace = TRUE)))) %>%
         distinct(Area, .keep_all = TRUE) %>%
         unnest(data)

We first nest df1 and join it with df2 by Area . For each Area we sample gender value based on probability in df2 and unnest to get long dataframe.

There are not enough samples in df1 to verify the result but if we increase number of rows in df1 the proportion should be similar to probability in df2 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM