[英]merging data.tables based on columns names
I am trying to do some left-join merges with data.tables. 我正在尝试与data.tables进行一些左联接合并。 The package description quote that
包描述引用
In all joins the names of the columns are irrelevant;
在所有联接中,列的名称都不相关; the columns of x's key are joined to in order
x的键的列按顺序连接到
I understand that I can use .data.table[
and data.table:::merge.data.table
我了解可以使用
.data.table[
和data.table:::merge.data.table
What I would like is : merge X and Y specifying the keys (like by.x and by.y in base merge, ->why taking this away ?) 我想要的是:合并X和Y以指定键(例如基本合并中的by.x和by.y,->为什么要取消此键?)
Let's suppose I have 假设我有
DT = data.table(x=rep(c("a","b","c"),each=3),y=c(1,3,6),v=1:9,key="x,y,v")
DT1 = data.frame(x1=c("aa","bb","cc"),y1=c(1,3,6),v1=1:3,key="x1,y1,v1")
and I would like this output: 我想要这个输出:
#data.table:::merge is masking I don't know how to call the base version of merge anymore
R) {base::merge}(DT,DT1,by.x="y",by.y="y1")
y x v x1 v1
1 1 a 1 aa 1
2 1 c 7 aa 1
3 1 b 4 aa 1
4 3 a 2 bb 2
5 3 b 5 bb 2
6 3 c 8 bb 2
7 6 b 6 cc 3
8 6 a 3 cc 3
9 6 c 9 cc 3
I am very happy to use [
or data.table:::merge
but I would like an option that do not modify DT
or DT1
(like changing the column names and calling merge and changing it back) 我很高兴使用
[
或data.table:::merge
但我想要一个不修改DT
或DT1
的选项(例如更改列名并调用merge并将其更改回)
Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table()
does accept and nicely handles arguments by.x=
and by.y=
. 更新:自data.table v1.9.6(2015年9月19日发布)以来,
merge.data.table()
确实接受并很好地处理了by.x=
和by.y=
参数。 Here's an updated link to the FR (now closed) referenced below. 这是到下面引用的FR(现已关闭) 的更新链接 。
Yes this is a feature request not yet implemented : 是的,这是尚未实现的功能请求:
FR#2033 Add by.x and by.y to merge.data.table FR#2033将by.x和by.y添加到merge.data.table
There isn't anything preventing it. 没有什么可以阻止它的。 Just something that wasn't done.
只是没有做的事情。 I very rarely need
merge
and was slow to realise its usefulness more generally. 我很少需要
merge
并且很慢地意识到它的用途。 We've made good progress in bringing merge
performance as fast as X[Y]
, and this feature request is at the highest priority. 在使
merge
性能达到X[Y]
速度方面,我们已经取得了良好的进展,并且此功能请求的优先级最高。 If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table
and commit the change yourself. 如果您希望更快地进行操作,不妨欢迎将这些参数添加到
merge.data.table
并merge.data.table
提交更改。 We try to keep source code short and together in one function/file, so by looking at merge.data.table
source hopefully you can follow it and see what needs to be done. 我们试图使源代码简短,并在一个函数/文件中保持在一起,因此希望通过查看
merge.data.table
源,您可以按照源代码进行操作,并查看需要执行的操作。
The arguments by.x
and by.y
are now available in the development version of data.table
. 现在,在
data.table
的开发版本中可以使用参数by.x
和by.y
See here . 看这里 。 Use
devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE)
to install the development version of data.table
. 使用
devtools::install_github("Rdatatable/data.table", build_vignettes = FALSE)
安装data.table
的开发版本。
You can't because the by columns must be in the intersection of colnames(DT) and colnames(DT1) 您不能这样做,因为by列必须位于colnames(DT)和colnames(DT1)的交集内
if (!all(by %in% intersect(colnames(x), colnames(y)))) {
stop("Elements listed in `by` must be valid column names in x and y")
}
Here using setnames , which which does not copy and is very fast 在这里使用setnames,它不会复制并且非常快
setnames(DT1,'y1','y')
> merge(DT,DT1)
y x v x1 v1
1: 1 a 1 aa 1
2: 1 b 4 aa 1
3: 1 c 7 aa 1
4: 3 a 2 bb 2
5: 3 b 5 bb 2
6: 3 c 8 bb 2
7: 6 a 3 cc 3
8: 6 b 6 cc 3
9: 6 c 9 cc 3
you should set the by
parameter otherwise you get an error: 您应该设置
by
参数,否则会出现错误:
Error in merge.data.table(DT, as.data.table(DT1)) :
Elements listed in `by` must be valid column names in x and y
You should do something like : 您应该执行以下操作:
merge(DT,DT1,by="y")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.