R 循環使用 data.table

Question

我有一個大約 300 萬行的數據集。 我創建了一個如下所示的小示例：

ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))

我有另一個 42 行的數據集，但生成了一個較小的示例：

add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))

如果trigger_cpt 值恰好是數據集主列中的值之一，並且proc1 值是 add_on 中的輔助值，我需要重新標記“trigger_cpt”列（按 eoc 分組）上的某些行數據集。 如果符合條件，則應將 trigger_cpt 重新標記為輔助代碼。

我首先手動輸入所有內容， ex[,trigger_new:= if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]

然后決定做一個for循環

for(i in 1:nrow(add_on)){
  ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}

但是，現在我正在我的 300 萬行數據集上嘗試此代碼，運行它需要很長時間。 我不確定是否有更好的方法，或者是否可以對當前代碼進行任何修改？

任何幫助將不勝感激！

預期 output：

ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))

Answer 1

這是一種產生 data.table 的方法，如果在分組集中找到匹配項，則將所有 trigger_cpt 設置為輔助值：


ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final  <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]

Output：


> ex.final
    eoc proc1 trigger_cpt
 1:   1 63035       63035
 2:   1 63020       63035
 3:   1 92344       63035
 4:   1 63035       63035
 5:   1 27567       63035
 6:   2 63020       63020
 7:   2  1234       63020
 8:   2 55678       63020
 9:   3 61112       61112
10:   3  1236       61112

此外，如果可行（這是有代價的），我會考慮使用setkey ，除非它弊大於利。 （初始處理可能使其不值得）。 它加快了下游操作，它可能使連接代碼更清晰。 data.table 代碼可能已經夠難了。 因此：


setkey(ex, trigger_cpt )
setkey(add_on, primary )

## can now do this:
add_on[ex]

## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]

## .. in the code above.

... 此外...

如果您正在修改上述步驟，您會注意到add_on[ex] （這是在 data.table 中進行左連接的有點倒退的方式），為您留下了add_on的鍵列名稱，而不是ex 。 這並不重要，只要您知道並最終適當地重命名列，但是加入數據的另一種方法可能是這樣的：


ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you

Answer 2

基於預期的 output

ex[, trigger_new := first(proc1), eoc]




ex
    eoc proc1 trigger_cpt trigger_new
 1:   1 63035       63020       63035
 2:   1 63020       63020       63035
 3:   1 92344       63020       63035
 4:   1 63035       63020       63035
 5:   1 27567       63020       63035
 6:   2 63020       63020       63020
 7:   2  1234       63020       63020
 8:   2 55678       63020       63020
 9:   3 61112       61112       61112
10:   3  1236       61112       61112

R 循環使用 data.table

問題描述

2 個解決方案

解決方案1
2 已采納 2021-03-29 23:06:33

解決方案2
1 2021-03-29 22:28:23

R 循環使用 data.table

問題描述

2 個解決方案

解決方案1 2 已采納 2021-03-29 23:06:33

解決方案2 1 2021-03-29 22:28:23

解決方案1
2 已采納 2021-03-29 23:06:33

解決方案2
1 2021-03-29 22:28:23