简体   繁体   English

ffbase:在X和Y列以及最接近的Z列上合并

[英]ffbase: merge on columns X and Y and closest column Z

I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). 我想使用ffdf完成以下操作:在X和Y列上合并,然后在最接近的Time上合并,然后在closes列B上合并。但是, 在较小的示例中 ,我知道的过程涉及使用外部合并(如下所示)。 What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? 使用ffbase,对于无法容纳在内存中(并且可能不适用于sqldf)的大型示例,该如何解决? If not possible, what would be the best library for this? 如果不可能的话,最好的图书馆是什么?

As a reproducible example, same as below: 作为可重现的示例,如下所示:

set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))

to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))

I borrow the following example from @ChinmayPatil here to highlight the similar procedure I would like to follow: ( R - merge dataframes on matching A, B and *closest* C? ): 我从@ChinmayPatil借用以下示例,以突出显示我要遵循的类似过程:( R-在匹配的A,B和* close * C?上合并数据帧 ):

require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))

to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))

## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )

## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)

##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]

Given that my question got so few views and no answers, I will describe the approach I came up with to solve this problem with the hopes that someone might find it useful (or even for me as a reminder for later in the future): 鉴于我的问题很少见且没有答案,我将描述我想出的解决此问题的方法,希望有人会发现它有用(甚至对我来说,以供日后参考):

To me, the most difficult aspect of performing this match on one columns and then nearest match on another columns is that I kept thinking that doing an outer join (as described in the post) was necessary. 对我来说,在一列上执行此匹配,然后在另一列上执行最接近的匹配,最困难的方面是我一直认为进行外部联接(如后所述)是必要的。 The solution is pretty simple using data.table and ffdfdply. 使用data.table和ffdfdply解决方案非常简单。 For the purpose of illustration, assume there is one large ffdf object and one regular data.table that fits in memory : 出于说明目的,假定有一个适合内存的大型ffdf对象和一个常规data.table

### Large ffdf object    
A <- as.ffdf(data.table( dates.A = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-31'), by = '3 days'), 
                     letters.A = LETTERS[1:4] , value.A = runif(4) ))

### Small data.table that fits in memory
B <- data.table( date.B = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-05'), by = 'days'), 
                 letters.B = LETTERS[1:4] , value.B = runif(4) )

Then you can simply define a function that does the merging using data.table and roll = 'nearest': 然后,您可以简单地定义一个使用data.table和roll ='nearest'进行合并的函数:

merge.ff <- function(x){
setDT(x)
x[, ':=' (dates.merge = dates.A, letters.merge = letters.A)]
B[, ':=' (dates.merge = date.B, letters.merge = letters.B)]
setkeyv(x, c('letters.merge','dates.merge'))
setkeyv(B, c('letters.merge','dates.merge'))

as.data.frame(B[x, roll = 'nearest'])
}

and apply it to A: 并将其应用于A:

result <- ffdfdply( A, split = A$dates.A, FUN = merge.ff)

the key was just essentially using the roll method in data.table and pass it to ffdfdply. 该键实际上只是在data.table中使用roll方法并将其传递给ffdfdply。 It seemed to be quite efficient. 看来效率很高。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用XY坐标合并簇和列的列? - How to Merge Column of Clusters and Columns with X-Y coord? 如何在按列(x,y)分组时有效地计算列z的多个分位数 - How to efficiently calculate multiple quantiles of column z when grouping by columns (x, y) ffbase合并问题 - R ffbase merge issues 将x,y,z的大平面文件读取到行名称x,列名称y和值z的表中 - Reading large flat file of x,y,z into table of row names x, column names y, and values z Plot 数据框中的每 3 列(如 x、y、z)function - Plot every 3 columns (as x,y,z) in a data frame function R获得3dim数组中所有x,y点在z方向上最接近值的索引 - R getting indices of closest value in z direction for all x,y points in 3dim array 用X列中的值替换X列中的值,但前提是X的值与Z列中的值匹配 - Replacing values in column X with values from column Y, but only if the values of X match the values in column Z 如何使用 x 和 y 列作为标签将 df (x,y,z) 转换为表格,并使用包含 z 的 function 填充单元格? - How to turn a df (x,y,z) into a table using the columns x and y as labels and fill the cells with a function including z? 合并基于列的数据框X和基于行字符串(R)的数据框 - Merge dataframes X based in column and Y based in row strings (R) 了解函数的应用:x&gt; y-z&x &lt;y + z - understanding application of function: x > y - z & x < y + z
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM