I have 2 data frames that I would like to merge. The difference between the data sets is the number of observations and the way they er collected. In df1
observation were recorded on 2 different days. Each record has an index, id1 person identification number and id2 refers the number of the day that recording was made (day had to be different ).There is also a Day variable that records the week day when the recording was made.
In df2
observations were recorded just based on serial number and id1 person identification number. There is just one observation per person. Similarly here there is also a Day variable that records when the recordings started.
I would like to identify the observations from df2 that were recorded on the same day as in df1.
I tried to create an newindex (to group index and id1) to go to long and merge based on days.
Df1:- day denotes when the observations were made (eg. index 12; id1 -denotes just 1 person; id2 denotes the 2 days -Wednesday id2 1 and Sunday id2 2)
index id1 id2 Day obs1 obs2 obs3
12 1 1 Wednesday 1 11 12
12 1 2 Sunday 2 0 0
123 1 1 Tuesday 1 0 1
123 1 2 Saturday 3 0 3
123 2 1 Monday 2 2 4
123 2 2 Saturday 1 0 8
df2: -here the day Day variable denotes the starting day from which the observations were made (eg. id 12 day2 and id 123 day1)
index id1 Day day1 day2 day3 day4 day5 day6 day7
12 1 Tuesday 2 1 2 1 1 3 1
123 1 Friday 0 3 0 3 3 0 3
Outcome:
index id1 id2 obs1 obs2 obs3
12 1 1 1 11 12
12 1 2 2 0 0
123 1 2 3 0 3
123 2 2 1 0 8
Sample data
df1:
structure(list(index = c(12, 12, 123, 123, 123, 123), id1 = c(1,
1, 1, 1, 2, 2), id2 = c(1, 2, 1, 2, 1, 2), Day = structure(c(5L,
3L, 4L, 2L, 1L, 2L), .Label = c("Monday", "Saturday", "Sunday",
"Tuesday", "Wednesday"), class = "factor"), obs1 = c(1, 2, 1,
3, 2, 1), obs2 = c(11, 0, 0, 0, 2, 0), obs3 = c(12, 0, 1, 3,
4, 8)), class = "data.frame", row.names = c(NA, -6L))
df2:
structure(list(index = c(12, 123), id1 = c(1, 1), Day = structure(2:1, .Label = c("Friday",
"Tuesday"), class = "factor"), day1 = c(2, 0), day2 = c(1, 3),
day3 = c(2, 0), day4 = c(1, 3), day5 = c(1, 3), day6 = c(3,
0), day7 = c(1, 3)), class = "data.frame", row.names = c(NA,
-2L))
We can get the df2
lin long format, group_by
index
keep the rows which occurred after observations were made and join it with df1
based on index
and Day
.
library(dplyr)
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday")
df2 %>%
mutate_at(vars(matches('day\\d+')), as.numeric) %>%
tidyr::pivot_longer(cols = matches('day\\d+')) %>%
group_by(index) %>%
filter(row_number() >= match(Day, weekday)[1L]) %>%
summarise(Day = match(Day, weekday)[1]) %>%
inner_join(df1 %>%mutate(Day = match(Day, weekday)), by = 'index') %>%
filter(Day.y >= Day.x)
# index Day.x id1 id2 Day.y obs1 obs2 obs3
# <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 12 2 1 1 3 1 11 12
#2 12 2 1 2 7 2 0 0
#3 123 5 1 2 6 3 0 3
#4 123 5 2 2 6 1 0 8
You can then use select
to only keep columns which are required.
An option with melt
from data.table
library(data.table)
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
If the datasets are haven
labelled
'Day', we first convert to factor
with as_factor
library(haven)
df1$Day <- as.character(as_factor(df1$Day))
df2$Day <- as.character(as_factor(df2$Day))
df1$Day <- match(df1$Day, weekday)
dt2 <- melt(setDT(df2), measure = patterns('^day\\d+$'))[seq_len(.N) >=
match(Day, weekday)[1L]][, .(Day = match(Day, weekday)[1]), index]
merge(setDT(df1), dt2, by = 'index')[Day.y < Day.x]
# index id1 id2 Day.x obs1 obs2 obs3 Day.y
#1: 12 1 1 3 1 11 12 2
#2: 12 1 2 7 2 0 0 2
#3: 123 1 2 6 3 0 3 5
#4: 123 2 2 6 1 0 8 5
Or using tidyverse
, it is better to return a list
column in summarise
and then unnest
(in case the lengths are not matching with the number of rows)
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = day1:day7) %>%
group_by(index) %>%
slice(match(Day, weekday)[1L]:n()) %>%
summarise(Day = match(Day, weekday)[1]) %>%
inner_join(df1 %>%
mutate(Day = match(Day, weekday)), by = 'index') %>%
filter(Day.y >= Day.x)
# A tibble: 4 x 8
# index Day.x id1 id2 Day.y obs1 obs2 obs3
# <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 12 2 1 1 3 1 11 12
#2 12 2 1 2 7 2 0 0
#3 123 5 1 2 6 3 0 3
#4 123 5 2 2 6 1 0 8
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.