简体   繁体   中英

Merge 2 data sets in long format based on a condition

I have 2 data frames that I would like to merge. The difference between the data sets is the number of observations and the way they er collected. In df1 observation were recorded on 2 different days. Each record has an index, id1 person identification number and id2 refers the number of the day that recording was made (day had to be different ).There is also a Day variable that records the week day when the recording was made.

In df2 observations were recorded just based on serial number and id1 person identification number. There is just one observation per person. Similarly here there is also a Day variable that records when the recordings started.

I would like to identify the observations from df2 that were recorded on the same day as in df1.

I tried to create an newindex (to group index and id1) to go to long and merge based on days.

Df1:- day denotes when the observations were made (eg. index 12; id1 -denotes just 1 person; id2 denotes the 2 days -Wednesday id2 1 and Sunday id2 2)

    index id1 id2  Day         obs1 obs2 obs3
     12    1   1   Wednesday    1    11   12
     12    1   2   Sunday       2     0    0
    123    1   1   Tuesday      1     0    1
    123    1   2   Saturday     3     0    3
    123    2   1   Monday       2     2    4
    123    2   2   Saturday     1     0    8

df2: -here the day Day variable denotes the starting day from which the observations were made (eg. id 12 day2 and id 123 day1)

index   id1  Day       day1 day2 day3 day4 day5 day6  day7   
 12      1    Tuesday     2    1    2    1    1    3    1    
123      1    Friday      0    3    0    3    3    0    3     

Outcome:

 index id1 id2   obs1 obs2 obs3 
 12      1   1     1   11    12   
 12      1   2     2    0     0
 123     1   2     3    0     3        
 123     2   2     1    0     8

Sample data

df1:

structure(list(index = c(12, 12, 123, 123, 123, 123), id1 = c(1, 
1, 1, 1, 2, 2), id2 = c(1, 2, 1, 2, 1, 2), Day = structure(c(5L, 
3L, 4L, 2L, 1L, 2L), .Label = c("Monday", "Saturday", "Sunday", 
"Tuesday", "Wednesday"), class = "factor"), obs1 = c(1, 2, 1, 
3, 2, 1), obs2 = c(11, 0, 0, 0, 2, 0), obs3 = c(12, 0, 1, 3, 
4, 8)), class = "data.frame", row.names = c(NA, -6L))

df2:

structure(list(index = c(12, 123), id1 = c(1, 1), Day = structure(2:1, .Label = c("Friday", 
"Tuesday"), class = "factor"), day1 = c(2, 0), day2 = c(1, 3), 
    day3 = c(2, 0), day4 = c(1, 3), day5 = c(1, 3), day6 = c(3, 
    0), day7 = c(1, 3)), class = "data.frame", row.names = c(NA, 
-2L))

We can get the df2 lin long format, group_by index keep the rows which occurred after observations were made and join it with df1 based on index and Day .

library(dplyr)
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", 
             "Saturday", "Sunday")


df2 %>%
  mutate_at(vars(matches('day\\d+')), as.numeric) %>%
  tidyr::pivot_longer(cols = matches('day\\d+')) %>%
  group_by(index) %>%
  filter(row_number() >= match(Day, weekday)[1L]) %>%
  summarise(Day = match(Day, weekday)[1]) %>%
  inner_join(df1 %>%mutate(Day = match(Day, weekday)), by = 'index') %>%
  filter(Day.y >= Day.x)


#  index Day.x   id1   id2 Day.y  obs1  obs2  obs3
#  <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1    12     2     1     1     3     1    11    12
#2    12     2     1     2     7     2     0     0
#3   123     5     1     2     6     3     0     3
#4   123     5     2     2     6     1     0     8

You can then use select to only keep columns which are required.

An option with melt from data.table

library(data.table)
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

If the datasets are haven labelled 'Day', we first convert to factor with as_factor

library(haven)
df1$Day <- as.character(as_factor(df1$Day))
df2$Day <- as.character(as_factor(df2$Day))
df1$Day <- match(df1$Day, weekday) 
dt2 <- melt(setDT(df2), measure = patterns('^day\\d+$'))[seq_len(.N) >=
    match(Day, weekday)[1L]][, .(Day = match(Day, weekday)[1]), index]
merge(setDT(df1), dt2, by = 'index')[Day.y < Day.x]
#   index id1 id2 Day.x obs1 obs2 obs3 Day.y
#1:    12   1   1     3    1   11   12     2
#2:    12   1   2     7    2    0    0     2
#3:   123   1   2     6    3    0    3     5
#4:   123   2   2     6    1    0    8     5

Or using tidyverse , it is better to return a list column in summarise and then unnest (in case the lengths are not matching with the number of rows)

library(dplyr)
library(tidyr)
df2 %>%
     pivot_longer(cols = day1:day7) %>%
     group_by(index) %>% 
     slice(match(Day, weekday)[1L]:n()) %>%
     summarise(Day = match(Day, weekday)[1]) %>%
     inner_join(df1 %>%
     mutate(Day = match(Day, weekday)), by = 'index') %>%
     filter(Day.y >= Day.x)
# A tibble: 4 x 8
#  index Day.x   id1   id2 Day.y  obs1  obs2  obs3
#  <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1    12     2     1     1     3     1    11    12
#2    12     2     1     2     7     2     0     0
#3   123     5     1     2     6     3     0     3
#4   123     5     2     2     6     1     0     8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM