簡體   English   中英

添加對滿足條件的所有先前行求和的列

[英]Add column that sums all previous rows that meet condition

說我有一個以下列的大表

   subject stim1 stim2 Chosen
1:       1     2     1      2
2:       1     3     2      2
3:       1     3     1      1
4:       1     2     3      3
5:       1     1     3      1

我正在尋找一種有效的方法(因為完整的數據集很大)來改變另外兩個列(按主題)

  1. stim1_seen, stim2_seen = 是當前 stim1 之前在 stim1 或 stim2 (stim1_seen) 中或 stim2 之前在 stim1 或 stim2 (stim2_seen) 中的所有先前實例的總和。
  2. stim1_chosen, stim2_chosen= 是選擇當前 stim1 和當前 stim2 的所有先前實例的總和。

所需 output

     subject stim1 stim2 Chosen  stim1_chosen   stim2_chosen
1:       1     2     1      2         0               0
2:       1     3     2      2         0               1
3:       1     3     1      1         0               0
4:       1     2     3      3         2               0
5:       1     1     3      1         1               1
6:       1     2     1      1         2               2

理想情況下,它會使用 data.table 或 dplyr。

這是輸入

structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021
), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 
47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fc9ce8158e0>)

好的,這適用於示例數據。 最好在我們有更多主題並且列中的值大於 1 的地方運行它。 我假設它的data.table object 稱為dt

1. 索引

使用merge操作更改行排序真的很容易,所以不要依賴行號,而是通過subject創建一個rowid .N是用於長度/行數的 data.table 語法。

# order matters, so make a rowid
dt[, rowid := 1:.N, by=subject]

# sets orders and indexing to make it quicker
setkey(dt, subject, rowid)

2. 見過的cols

需要將stim1stim2合並為一列。 通過使用melt從寬格式到長格式來做到這一點。 seen:=0:(.N-1)然后按這些值分組以按行查找累積出現次數。 但是當我們查看先前的值時,我們減去 1。

然后我們進行兩次合並,因為我們有興趣將其與兩個 stim cols 進行比較

# for seen, melt wide to long
dt_seen <- melt(dt, 
                id.vars = c("subject", "rowid"), 
                measure.vars = c("stim1", "stim2"))
# interested in finding occurences
dt_seen <- unique(dt_seen[, .(subject, rowid, value)])
setorder(dt_seen, rowid)
dt_seen[, seen:=0:(.N-1), by=.(subject, value)]

# merge across twice
dt <- merge(dt, dt_seen, 
            by.x=c("subject", "rowid", "stim1"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim1_seen")
dt <- merge(dt, dt_seen, 
            by.x=c("subject", "rowid", "stim2"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim2_seen")
dt[]

3. 選擇

我一直很懶惰並且有效地完成了與第 (2) 節相同的操作,但首先過濾到 Chosen 與 stim 值匹配的行。 並且一個一個地做而不是一起做,因為這些cols是獨立的。 stim1 和 stim2 的過程是相同的,所以可以稍微整理一下。

# turn Chosen from wide to long
dt_chosen <- melt(dt,
                  id.vars = c("subject", "rowid"), 
                  measure.vars = c("Chosen"))
# interested in finding occurences
# need to expand 
dt_chosen[, variable := NULL]
# going to expand the grid, so can look at e.g. value 50 for all rowids
library(tidyr)
dt_chosen[, chosen_row := 1]
dt_chosen_full <- expand(dt_chosen, nesting(subject, rowid), value) %>% setDT
# pull in the actual data and fill rest with 0's
dt_chosen_full <- merge(dt_chosen_full, dt_chosen, by=c("subject", "rowid", "value"),
                        all.x=TRUE)
dt_chosen_full[is.na(chosen_row), chosen_row := 0]
# use cumsum to identify now the cumulative count of these across the full row set
dt_chosen_full[, chosen := cumsum(chosen_row), by=.(subject, value)]
# as its prior, on the row itself, subtract one so the update happens after the row
dt_chosen_full[chosen_row==1, chosen := chosen-1]

# merge across twice
dt <- merge(dt, dt_chosen_full[, -"chosen_row"], 
            by.x=c("subject", "rowid", "stim1"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim1_chosen")
dt[is.na(stim1_chosen), stim1_chosen := 0]

dt <- merge(dt, dt_chosen_full[, -"chosen_row"], 
            by.x=c("subject", "rowid", "stim2"), 
            by.y=c("subject", "rowid", "value"), 
            all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim2_chosen")
dt[is.na(stim2_chosen), stim2_chosen := 0]

Output

dt[]
   subject rowid stim2 stim1 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
1:    1021     1    50    51     50          0          0            0            0
2:    1021     2    50    48     50          0          1            0            1
3:    1021     3    47    49     49          0          0            0            0
4:    1021     4    46    48     48          1          0            0            0
5:    1021     5    51    49     49          1          1            1            0
6:    1021     6    47    46     46          1          1            0            0

這是一個 pipe,在兩個框架上都進行了演示。

dat1是您顯示一些預期的 output 的地方

dat1[, c("stim1_seen", "stim2_seen") :=
         lapply(.SD, function(z) mapply(function(x, S) {
           sum(stim1[S] %in% x | stim2[S] %in% x)
         }, z, lapply(seq_len(.N)-1, seq_len))),
     .SDcols = c("stim1", "stim2"), by = .(subject)
     ][, c("stim1_chosen", "stim2_chosen") :=
           lapply(.SD, function(z) mapply(function(x, S) {
             sum(Chosen[S] %in% x)
           }, z, lapply(seq_len(.N)-1, seq_len))),
       .SDcols = c("stim1", "stim2"), by = .(subject)]
#    subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
#      <int> <int> <int>  <int>      <int>      <int>        <int>        <int>
# 1:       1     2     1      2          0          0            0            0
# 2:       1     3     2      2          0          1            0            1
# 3:       1     3     1      1          1          1            0            0
# 4:       1     2     3      3          2          2            2            0
# 5:       1     1     3      1          2          3            1            1
# 6:       1     2     1      1          3          3            2            2

dat2是您的輸入 output (不同的數據)

dat2[, c("stim1_seen", "stim2_seen") :=
         lapply(.SD, function(z) mapply(function(x, S) {
           sum(stim1[S] %in% x | stim2[S] %in% x)
         }, z, lapply(seq_len(.N)-1, seq_len))),
     .SDcols = c("stim1", "stim2"), by = .(subject)
     ][, c("stim1_chosen", "stim2_chosen") :=
           lapply(.SD, function(z) mapply(function(x, S) {
             sum(Chosen[S] %in% x)
           }, z, lapply(seq_len(.N)-1, seq_len))),
       .SDcols = c("stim1", "stim2"), by = .(subject)]
#    subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
#      <num> <int> <int>  <int>      <int>      <int>        <int>        <int>
# 1:    1021    51    50     50          0          0            0            0
# 2:    1021    48    50     50          0          1            0            1
# 3:    1021    49    47     49          0          0            0            0
# 4:    1021    48    46     48          1          0            0            0
# 5:    1021    49    51     49          1          1            1            0
# 6:    1021    46    47     46          1          1            0            0

這里的努力是試圖做一個“累積%in% ”。 實際上,這就是mapply正在做的事情。

  • 知道data.table.N特殊符號提供了組中的行數,那么這很有用:

     lapply(seq_len(.N)-1, seq_len) # [[1]] # integer(0) # [[2]] # [1] 1 # [[3]] # [1] 1 2 # [[4]] # [1] 1 2 3 # [[5]] # [1] 1 2 3 4 # [[6]] # [1] 1 2 3 4 5

    這用於索引每行之前的所有行; 也就是說,在第 1 行中,沒有前面的行,因此我們在integer(0)上進行索引; 在第 5 行,我們對1 2 3 4進行索引; 等等

  • 我們將它們與每個stim1 (然后stim2值)一起“壓縮”(使用mapply ),以查找S上索引的原始stim1stim2列(來自上一個項目符號)中的存在,並對出現的次數求和

  • 最后,我們通過迭代.SD (使用.SDcols )對兩個stim*列執行此操作

  • Chosen列上重復此過程(盡管更簡單)


數據

dat1 <- setDT(structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(2L, 3L, 3L, 2L, 1L, 2L), stim2 = c(1L, 2L, 1L, 3L, 3L, 1L), Chosen = c(2L, 2L, 1L, 3L, 1L, 1L)), class = c("data.table", "data.frame"), row.names = c(NA, -6L)))
dat2 <- setDT(structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM