简体   繁体   中英

split join data.table R

Objective

Join DT1 (as i in data.table ) to DT2 given key(s) column(s), within each group of DT2 specified by the Date column.

I cannot run DT2[DT1, on = 'key'] as that would be incorrect since key column is repeated across the Date column, but unique within a single date.

Reproducible example with a working solution

DT3 is my expected output. Is there any way to achieve this without the split manoeuvre, which does not feel very data.table -y?

library(data.table)
set.seed(1)
DT1 <- data.table(
  Segment  = sample(paste0('S', 1:10), 100, TRUE), 
  Activity = sample(paste0('A', 1:5), 100, TRUE), 
  Value    = runif(100)
)
dates <- seq(as.Date('2018-01-01'), as.Date('2018-11-30'), by = '1 day')
DT2 <- data.table(
  Date    = rep(dates, each = 5), 
  Segment = sample(paste0('S', 1:10), 3340, TRUE), 
  Total   = runif(3340, 1, 2)
)
rm(dates)
# To ensure that each Date Segment combination is unique
DT2 <- unique(DT2, by = c('Date', 'Segment'))
iDT2 <- split(DT2, by = 'Date')
iDT2 <- lapply(
  iDT2, 
  function(x) {
    x[DT1, on = 'Segment', nomatch = 0]
  }
)
DT3 <- rbindlist(iDT2, use.names = TRUE)

You can achieve the same result with a cartesian merge :

DT4 <- merge(DT2,DT1,by='Segment',allow.cartesian = TRUE)

Here is the proof:

> all(DT3[order(Segment,Date,Total,Activity,Value),
          c('Segment','Date','Total','Activity','Value')] == 
      DT4[order(Segment,Date,Total,Activity,Value),
          c('Segment','Date','Total','Activity','Value')])

[1] TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM