简体   繁体   中英

Data.table - lapply function within .SD with merge not working

I'm doing some product association work where I have two large data.tables. One is a rules table (2.4m rows) and one is a customer product table (3m rows). Effectively what I want to do is merge the two together and select the top 10 products per customer, but doing this at the high level isn't viable due to the size. However, to get round this, I want to iteratively merge the two tables at a customer level, select the top 10 products and return it.

The below example probably explains it better:

require(data.table)

customer <- data.table(customer=rep(seq(1:5),3),product_bought=rep(c("A","B","C"),5), number=runif(15))[order(customer)]

rules <- data.table(product_bought=c("A","B","C"),recommended_product=c("D","E","F"),number2=runif(3,min=100,max=200))

customer[,lapply(.SD, function(z){
  a <- merge(z,rules,by="product_bought")
  a[,new:=number*number2]
  a[new==max(new)]
  return(a)
}),by=customer]

But I get the following error:

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid colum

What I want it to do for all customers is this:

z <- customer[customer==1]
a <- merge(z,rules,by="product_bought")
a[,new:=number*number2]
a[new==max(new)]

Which gives:

> a[new==max(new)]
   product_bought customer   number recommended_product  number2     new
1:  

        C        1           0.613043                F  168.4335     103.257

I did try using lists, but having a list of 30k data.tables had issues when trying to rbindlist it back up again.

Any ideas why the merge within a .SD doesn't work?

Cheers, Scott

I guess you were trying to do this:

customer[, {
             a <- merge(.SD,rules,by="product_bought");
             a[, new:=number*number2];
             a[new==max(new)]
           }, by = customer]

But it's much better to do a single merge:

customer[rules, on = 'product_bought', new := number * number2]
customer[, .SD[new == max(new)], by = customer]

Or do the .I trick if the last line is too slow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM