简体   繁体   English

根据列名称合并data.tables

[英]merging data.tables based on columns names

I am trying to do some left-join merges with data.tables. 我正在尝试与data.tables进行一些左联接合并。 The package description quote that 包描述引用

In all joins the names of the columns are irrelevant; 在所有联接中,列的名称都不相关; the columns of x's key are joined to in order x的键的列按顺序连接到

I understand that I can use .data.table[ and data.table:::merge.data.table 我了解可以使用.data.table[data.table:::merge.data.table

What I would like is : merge X and Y specifying the keys (like by.x and by.y in base merge, ->why taking this away ?) 我想要的是:合并X和Y以指定键(例如基本合并中的by.x和by.y,->为什么要取消此键?)

Let's suppose I have 假设我有

DT = data.table(x=rep(c("a","b","c"),each=3),y=c(1,3,6),v=1:9,key="x,y,v")
DT1 = data.frame(x1=c("aa","bb","cc"),y1=c(1,3,6),v1=1:3,key="x1,y1,v1")

and I would like this output: 我想要这个输出:

#data.table:::merge is masking I don't know how to call the base version of merge anymore
R) {base::merge}(DT,DT1,by.x="y",by.y="y1") 
y x v x1 v1
1 1 a 1 aa  1
2 1 c 7 aa  1
3 1 b 4 aa  1
4 3 a 2 bb  2
5 3 b 5 bb  2
6 3 c 8 bb  2
7 6 b 6 cc  3
8 6 a 3 cc  3
9 6 c 9 cc  3

I am very happy to use [ or data.table:::merge but I would like an option that do not modify DT or DT1 (like changing the column names and calling merge and changing it back) 我很高兴使用[data.table:::merge但我想要一个不修改DTDT1的选项(例如更改列名并调用merge并将其更改回)

Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table() does accept and nicely handles arguments by.x= and by.y= . 更新:data.table v1.9.6(2015年9月19日发布)以来, merge.data.table()确实接受并很好地处理了by.x=by.y=参数。 Here's an updated link to the FR (now closed) referenced below. 这是到下面引用的FR(现已关闭) 的更新链接


Yes this is a feature request not yet implemented : 是的,这是尚未实现的功能请求:

FR#2033 Add by.x and by.y to merge.data.table FR#2033将by.x和by.y添加到merge.data.table

There isn't anything preventing it. 没有什么可以阻止它的。 Just something that wasn't done. 只是没有做的事情。 I very rarely need merge and was slow to realise its usefulness more generally. 我很少需要merge并且很慢地意识到它的用途。 We've made good progress in bringing merge performance as fast as X[Y] , and this feature request is at the highest priority. 在使merge性能达到X[Y]速度方面,我们已经取得了良好的进展,并且此功能请求的优先级最高。 If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table and commit the change yourself. 如果您希望更快地进行操作,不妨欢迎将这些参数添加到merge.data.tablemerge.data.table提交更改。 We try to keep source code short and together in one function/file, so by looking at merge.data.table source hopefully you can follow it and see what needs to be done. 我们试图使源代码简短,并在一个函数/文件中保持在一起,因此希望通过查看merge.data.table源,您可以按照源代码进行操作,并查看需要执行的操作。

The arguments by.x and by.y are now available in the development version of data.table . 现在,在data.table的开发版本中可以使用参数by.xby.y See here . 这里 Use devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE) to install the development version of data.table . 使用devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE)安装data.table的开发版本。

You can't because the by columns must be in the intersection of colnames(DT) and colnames(DT1) 您不能这样做,因为by列必须位于colnames(DT)和colnames(DT1)的交集内

 if (!all(by %in% intersect(colnames(x), colnames(y)))) {
       stop("Elements listed in `by` must be valid column names in x and y")
   }

Here using setnames , which which does not copy and is very fast 在这里使用setnames,它不会复制并且非常快

setnames(DT1,'y1','y')
> merge(DT,DT1)
   y x v x1 v1
1: 1 a 1 aa  1
2: 1 b 4 aa  1
3: 1 c 7 aa  1
4: 3 a 2 bb  2
5: 3 b 5 bb  2
6: 3 c 8 bb  2
7: 6 a 3 cc  3
8: 6 b 6 cc  3
9: 6 c 9 cc  3

EDIT update with data.table version data.table 1.9.4 使用data.table版本data.table 1.9.4进行EDIT更新

you should set the by parameter otherwise you get an error: 您应该设置by参数,否则会出现错误:

Error in merge.data.table(DT, as.data.table(DT1)) : 
  Elements listed in `by` must be valid column names in x and y

You should do something like : 您应该执行以下操作:

merge(DT,DT1,by="y")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM