简体   繁体   English

data.table基于组的滞后值删除行

[英]data.table remove rows based on lag value by group

I have a data.table in the following form: 我有一个data.table ,格式如下:

DT <- data.table(tag = rep(c("A", "B"), each = 10),
                 value =  c(0, 3, 3, 3, 0, 1, 1, 1, 3, 0,
                            0, 1, 3, 1, 0, 3, 0, 1, 1, 0))
> DT
    tag value
 1:   A     0
 2:   A     3
 3:   A     3
 4:   A     3
 5:   A     0
 6:   A     1
 7:   A     1
 8:   A     1
 9:   A     3
10:   A     0
11:   B     0
12:   B     1
13:   B     3
14:   B     1
15:   B     0
16:   B     3
17:   B     0
18:   B     1
19:   B     1
20:   B     0

I would like to remove all the rows that have value of 3 but only those follow a 0. That is I would like to remove row 2, 3, 4 and row 16, but need to keep row 9 and row 13. 我想删除所有值为3但仍然只有0的行。这是我想删除第2,3,4和16行,但需要保留第9行和第13行。

Is there is a way to perform this? 有办法执行此操作吗?

A possible solution: 可能的解决方案:

DT[, `:=` (threes = rleid(value==3), apz = value == 3 & shift(value) == 0)
   ][, if (all(!apz)) .SD, by = threes
     ][, c('threes','apz') := NULL]

which gives: 这使:

    tag value
 1:   A     0
 2:   A     0
 3:   A     1
 4:   A     1
 5:   A     1
 6:   A     3
 7:   A     0
 8:   B     0
 9:   B     1
10:   B     3
11:   B     1
12:   B     0
13:   B     0
14:   B     1
15:   B     1
16:   B     0
DT[, prev.value := shift(value), by = tag][
   , prev.value := prev.value[1], by = .(tag, rleid(value))][
   !(value == 3 & prev.value == 0)]
#    tag value prev.value
# 1:   A     0         NA
# 2:   A     0          3
# 3:   A     1          0
# 4:   A     1          0
# 5:   A     1          0
# 6:   A     3          1
# 7:   A     0          3
# 8:   B     0         NA
# 9:   B     1          0
#10:   B     3          1
#11:   B     1          3
#12:   B     0          1
#13:   B     0          3
#14:   B     1          0
#15:   B     1          0
#16:   B     0          1

Here's a one-liner of sorts (props to @Procrastinatus for the improvement): 这里有各种各样的东西(@Procrastinatus的改进道具):

DT[setDT(rle(value))[, rep(!( values==3 & shift(values)==0 ), lengths)] ]

To understand how it works, try running DT[, setDT(rle(value))] , showing how R summarizes runs of sequential values, and read ?rle . 要了解它是如何工作的,请尝试运行DT[, setDT(rle(value))] ,显示R如何汇总顺序值的运行,并读取?rle


My original approach was: 我最初的做法是:

DT[ rleid(value) %in% setDT(rle(value))[ , .I[!( values==3 & shift(values)==0 )]] ]

Try DT[, rleid(value)] and read ?rleid for details. 试试DT[, rleid(value)]并阅读?rleid了解详情。 This second approach is worse because the runs are evaluated twice (using both rle and rleid ). 因为运行被两次评估(同时使用第二种方法更糟糕的是rlerleid )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM