简体   繁体   English

与data.tables合并

[英]Binning with data.tables

I want to create bins using one table and apply them to another table. 我想使用一个表创建垃圾箱,并将它们应用于另一张表。 I did this: 我这样做:

library(data.table)
library(Hmisc) # for cut2

# (1) Make two data.tables A and B
a <- sample(10:100, 10000, replace=TRUE)
b <- sample(10:90, 10000, replace=TRUE)
A <- data.table(a,b)
a <- sample(0:110, 10000, replace=TRUE)
b <- sample(50:100, 10000, replace=TRUE)
B <- data.table(a,b)

# (2) Create bins using table A (per column)
cc<-A[,lapply(.SD,cut2,g=5, onlycuts=TRUE)]

# (3) Add -Inf and Inf to the cuts (to cope with values in B outside the bins of A)
cc<-rbind(data.table(a=-Inf,b=-Inf),cc,data.table(a=Inf,b=Inf))

# (4) Apply the bins to table B (and table A for inspection)
A[,ac:=as.numeric(cut2(A$a,cuts=cc$a))]
A[,bc:=as.numeric(cut2(A$b,cuts=cc$b))]
B[,ac:=as.numeric(cut2(B$a,cuts=cc$a))]
B[,bc:=as.numeric(cut2(B$b,cuts=cc$b))]

It works, but I want to make step 4 in a proper way, ie similar to step 2. 它可以工作,但是我想以适当的方式进行步骤4,即类似于步骤2。

The closest I came was this: 我最接近的是:

B[,lapply(.SD,cut2,cuts=cc$a),.SDcols=c("a","b")]

But this is not what I want, as it uses the bins of only one column (a) for all columns, and it gives the intervals rather than the bin numbers as I cannot figure out how to place the as.numeric. 但这不是我想要的,因为它对所有列仅使用一列(a)的bin,并且它给出间隔而不是bin号,因为我无法弄清楚如何放置as.numeric。

Thank's in advance for any pointers 预先感谢任何指示

UPDATE Thank you mathematical.coffee for the helpful advice. 更新谢谢mathic.coffee提供的有用建议。 I now have a generic approach: 我现在有一个通用的方法:

# (3) Add -Inf and Inf to the cuts (to cope with values in B outside the bins of A)
C<-data.table(c(-Inf,Inf),c(-Inf,Inf))
setnames(C,colnames(cc))
qc<-rbind(C[1],qc,C[2])

# (4) Apply the bins to table B 
B[,paste0(colnames(cc),"q"):=mapply(function(x, cuts) as.numeric(cut2(x, cuts)), .SD, qc, SIMPLIFY=F),.SDcols=colnames(qc)]

You can use mapply to match up columns in .SD to those in cc . 您可以使用mapply在列匹配.SD到那些cc

B[, mapply(cut2, .SD, cc),.SDcols=c("a","b")]
# or if you wish to assign the result
B[, c('ac', 'bc'):=mapply(cut2, .SD, cc, SIMPLIFY=F),.SDcols=c("a","b")]

This will return the result in interval form as in "[47, 65)" ; 这将以"[47, 65)"中的间隔形式返回结果; if you want the numeric form then just use 如果您想要数字形式,则只需使用

mapply(function(x, cuts) as.numeric(cut2(x, cuts)), .SD, cc)

Note the mapply won't actually match up the names of .SDcols with the names of cc ; 注意, mapply实际上mapply.SDcols的名称与cc的名称匹配; it just uses the columns in the order that they appear. 它只按它们出现的顺序使用列。 You could use .SDcols=names(cc) if you wanted to make sure that they would match. 如果要确保它们匹配,可以使用.SDcols=names(cc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM