简体   繁体   中英

ffbase: merge on columns X and Y and closest column Z

I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? If not possible, what would be the best library for this?

As a reproducible example, same as below:

set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))

to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))

I borrow the following example from @ChinmayPatil here to highlight the similar procedure I would like to follow: ( R - merge dataframes on matching A, B and *closest* C? ):

require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))

to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))

## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )

## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)

##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]

Given that my question got so few views and no answers, I will describe the approach I came up with to solve this problem with the hopes that someone might find it useful (or even for me as a reminder for later in the future):

To me, the most difficult aspect of performing this match on one columns and then nearest match on another columns is that I kept thinking that doing an outer join (as described in the post) was necessary. The solution is pretty simple using data.table and ffdfdply. For the purpose of illustration, assume there is one large ffdf object and one regular data.table that fits in memory :

### Large ffdf object    
A <- as.ffdf(data.table( dates.A = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-31'), by = '3 days'), 
                     letters.A = LETTERS[1:4] , value.A = runif(4) ))

### Small data.table that fits in memory
B <- data.table( date.B = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-05'), by = 'days'), 
                 letters.B = LETTERS[1:4] , value.B = runif(4) )

Then you can simply define a function that does the merging using data.table and roll = 'nearest':

merge.ff <- function(x){
setDT(x)
x[, ':=' (dates.merge = dates.A, letters.merge = letters.A)]
B[, ':=' (dates.merge = date.B, letters.merge = letters.B)]
setkeyv(x, c('letters.merge','dates.merge'))
setkeyv(B, c('letters.merge','dates.merge'))

as.data.frame(B[x, roll = 'nearest'])
}

and apply it to A:

result <- ffdfdply( A, split = A$dates.A, FUN = merge.ff)

the key was just essentially using the roll method in data.table and pass it to ffdfdply. It seemed to be quite efficient.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM