简体   繁体   中英

Merge data sets based on group identifiers

I have 2 data sets with 2 different type of observations. The observations were made during different days and they are recorded on different time intervals.

Both records have a serial number that is used to identify a group people who conducted observations. For example serial 111 denotes a group people. This group is formed by different number of people. The number of people belonging to these groups varies. For example group 111 consists of 3 people. In the diaries we can identify people belonging to different groups by serial and id1 variable. For example serial 111 and id1 2 means that the observation was made by person number two from the group 111. There is also a Day variable that denotes the week day when the observation was made. The Day variable takes values from 1(Monday) to 7 (Sunday) .

If in df1 we have 1 observation per person in df2 each person had to conducted 2 observations. The person who made the observation can be identified based on serial, id1 and id2 . Id2 is used to make difference between the week day observations. For example id 111, id1 3 and id2 2 can be interpreted as the 2 day observation made by person number 2 from the group 111. The week day of the observation is similarly saved by the Day variable.

I want to identify those persons who recorded information on the same day in both diaries. So, who are those individuals who filled in both records on the same day. The problem is that in df2 there are 2 observations and in df1 just one per person and this makes merging difficult.

I merged based on serial and id1 but they are not unique identifiers. I tried to create a new variable and to merge on 'Day' level.

How can I merge the 2 data sets on daily level?

library(dplyr)

df1<-df1 %>% 
      mutate(index = group_indices_(df1, .dots=c("serial", "id1"))) 

df2<-df2 %>% 
      mutate(index = group_indices_(df2, .dots=c("serial", "id1", "id2")))

Sample date:

df1

structure(list(serial = c(12, 123, 123, 10, 10), id1 = c(1, 1, 
2, 1, 2), Day = c(1, 3, 2, 4, 2)), class = "data.frame", row.names = c(NA, 
-5L))

df2

structure(list(serial = c(12, 12, 123, 123, 123, 123, 10, 10, 
10, 10, 10, 10), id1 = c(1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3), 
    id2 = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), Day = c(1, 6, 
    3, 7, 2, 7, 4, 7, 2, 7, 4, 7), index = c(7L, 8L, 9L, 10L, 
    11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L)), row.names = c(NA, -12L
), class = "data.frame")

Sample data outcome:

serial id1 id2 Day
12      1   1   1
123     1   1   3
123     2   1   2
10      1   1   4
10      2   1   2

You can add the corresponding id2 value from df2 to df1 with an update-join using data.table

library(data.table)
setDT(df1)
setDT(df2)

df1[df2, id2 := i.id2, on = .(serial, id1, Day)]

df1
#    serial id1 Day id2
# 1:     12   1   1   1
# 2:    123   1   3   1
# 3:    123   2   2   1
# 4:     10   1   4   1
# 5:     10   2   2   1

You can try merge like below

merge(df1,df2,all.x = T)[1:4]

such that

> merge(df1,df2,all.x = T)[1:4]
  serial id1 Day id2
1     10   1   4   1
2     10   2   2   1
3     12   1   1   1
4    123   1   3   1
5    123   2   2   1

Use merge: out <- merge(d1, d2, by = c('serial', 'id1')) and then select the columns serial, id1, id2, Day

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM