[英]How to update both data.tables in a join
Suppose I would like to track which rows from one data.table were merged to another data.table.假设我想跟踪一个 data.table 中的哪些行被合并到另一个 data.table。 is there a way to do this at once/while merging?有没有办法一次/合并时做到这一点? Please see my example below and the way I usually do it.请参阅下面的示例以及我通常使用的方式。 However, this seems rather inefficient.然而,这似乎相当低效。
library(data.table)
# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2),
y = c(1,3,6))
# data to merge
DTx <- data.table(x = 1:3,
y = 1,
k = "X")
# regular update join
copy(DT)[DTx,
on = .(x, y),
k := i.k][]
#> x y k
#> 1: 1 1 X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1 X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1 X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>
# DTx remains the same
DTx
#> x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X
# set an Id variable
DTx[, Id := .I]
# assign the Id in merge
DT[DTx,
on = .(x, y),
`:=`(k = i.k,
matched_id = i.Id)][]
#> x y k matched_id
#> 1: 1 1 X 1
#> 2: 1 3 <NA> NA
#> 3: 1 6 <NA> NA
#> 4: 2 1 X 2
#> 5: 2 3 <NA> NA
#> 6: 1 6 <NA> NA
#> 7: 1 1 X 1
#> 8: 2 3 <NA> NA
#> 9: 2 6 <NA> NA
# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#> x y k Id matched
#> 1: 1 1 X 1 TRUE
#> 2: 2 1 X 2 TRUE
#> 3: 3 1 X 3 FALSE
Following Jan's comment:根据 Jan 的评论:
This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.这将为您提供匹配行的索引,但您必须再次调用 merge 才能执行实际合并,除非您手动使用提供的索引来匹配/更新这些表。
You can pull the indices:您可以提取索引:
merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]
x y irow xrow
1: 1 1 1 1
2: 1 1 1 7
3: 2 1 2 4
4: 3 1 3 0
Then apply edits to each table using indices rather than merging or matching a second time:然后使用索引对每个表应用编辑,而不是再次合并或匹配:
rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]
How it works :它是如何工作的:
x[i]
, the symbol .I
indexes rows of x
连接时, x[i]
符号.I
索引x
的行by=.EACHI
, .GRP
indexes each group, which means each row of i
here当使用by=.EACHI
在连接中分组时, .GRP
索引每个组,这意味着i
的每一行.I
which are coded as zeros我们删除编码为零的.I
的不匹配值On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]
.在最后一点上,我们可能期望DT[DTx, on=.(x, y), which=TRUE]
返回的 NAs 而不是零。 I'm not sure why these differ.我不确定为什么这些不同。
Suppose I would like to track which rows from one data.table were merged to another data.table.假设我想跟踪一个 data.table 中的哪些行被合并到另一个 data.table。 is there a way to do this at once/while merging?有没有办法一次/合并时做到这一点? [...] seems rather inefficient. [...] 似乎相当低效。
I expect this is more efficient than multiple merges or %in%
when the merge is costly enough.我希望这比多次合并更有效,或者当合并成本足够高时%in%
。
It still requires multiple steps.它仍然需要多个步骤。 I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.我怀疑有什么办法可以解决这个问题,因为很难为易于遵循的更新提出逻辑和语法。
Update logic is already complex in base R, with multiple edits on a single index allowed:基础 R 中的更新逻辑已经很复杂,允许对单个索引进行多次编辑:
> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3
And there is the question of how to match and edit multiple indices at once:还有一个问题是如何一次匹配和编辑多个索引:
> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5
In data.table updates, the latter issue is handled with mult=
.在 data.table 更新中,后一个问题使用mult=
处理。 In the update-two-tables use case, these questions would get much more complicated.在 update-two-tables 用例中,这些问题会变得更加复杂。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.