简体   繁体   English

如何更新连接中的两个 data.tables

[英]How to update both data.tables in a join

Suppose I would like to track which rows from one data.table were merged to another data.table.假设我想跟踪一个 data.table 中的哪些行被合并到另一个 data.table。 is there a way to do this at once/while merging?有没有办法一次/合并时做到这一点? Please see my example below and the way I usually do it.请参阅下面的示例以及我通常使用的方式。 However, this seems rather inefficient.然而,这似乎相当低效。

Example例子

library(data.table)

# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2), 
                y = c(1,3,6))

# data to merge
DTx <- data.table(x = 1:3,
                  y = 1,
                  k = "X")

# regular update join
copy(DT)[DTx,
         on = .(x, y),
         k := i.k][]
#>    x y    k
#> 1: 1 1    X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1    X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1    X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>

# DTx remains the same
DTx
#>    x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X

What I usually do:我通常做的事情:

# set an Id variable
DTx[, Id := .I]

# assign the Id in merge
DT[DTx,
   on = .(x, y),
   `:=`(k = i.k,
        matched_id = i.Id)][]
#>    x y    k matched_id
#> 1: 1 1    X          1
#> 2: 1 3 <NA>         NA
#> 3: 1 6 <NA>         NA
#> 4: 2 1    X          2
#> 5: 2 3 <NA>         NA
#> 6: 1 6 <NA>         NA
#> 7: 1 1    X          1
#> 8: 2 3 <NA>         NA
#> 9: 2 6 <NA>         NA

# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#>    x y k Id matched
#> 1: 1 1 X  1    TRUE
#> 2: 2 1 X  2    TRUE
#> 3: 3 1 X  3   FALSE

Following Jan's comment:根据 Jan 的评论:

This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.这将为您提供匹配行的索引,但您必须再次调用 merge 才能执行实际合并,除非您手动使用提供的索引来匹配/更新这些表。

You can pull the indices:您可以提取索引:

merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]

   x y irow xrow
1: 1 1    1    1
2: 1 1    1    7
3: 2 1    2    4
4: 3 1    3    0

Then apply edits to each table using indices rather than merging or matching a second time:然后使用索引对每个表应用编辑,而不是再次合并或匹配:

rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]

How it works :它是如何工作的

  • When joining, x[i] , the symbol .I indexes rows of x连接时, x[i]符号.I索引x的行
  • When grouping in a join with by=.EACHI , .GRP indexes each group, which means each row of i here当使用by=.EACHI在连接中分组时, .GRP索引每个组,这意味着i的每一行
  • We drop the non-matching values of .I which are coded as zeros我们删除编码为零的.I的不匹配值

On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE] .在最后一点上,我们可能期望DT[DTx, on=.(x, y), which=TRUE]返回的 NAs 而不是零。 I'm not sure why these differ.我不确定为什么这些不同。


Suppose I would like to track which rows from one data.table were merged to another data.table.假设我想跟踪一个 data.table 中的哪些行被合并到另一个 data.table。 is there a way to do this at once/while merging?有没有办法一次/合并时做到这一点? [...] seems rather inefficient. [...] 似乎相当低效。

I expect this is more efficient than multiple merges or %in% when the merge is costly enough.我希望这比多次合并更有效,或者当合并成本足够高时%in%

It still requires multiple steps.它仍然需要多个步骤。 I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.我怀疑有什么办法可以解决这个问题,因为很难为易于遵循的更新提出逻辑和语法。

Update logic is already complex in base R, with multiple edits on a single index allowed:基础 R 中的更新逻辑已经很复杂,允许对单个索引进行多次编辑:

> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3

And there is the question of how to match and edit multiple indices at once:还有一个问题是如何一次匹配和编辑多个索引:

> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5

In data.table updates, the latter issue is handled with mult= .在 data.table 更新中,后一个问题使用mult=处理 In the update-two-tables use case, these questions would get much more complicated.在 update-two-tables 用例中,这些问题会变得更加复杂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM