如何删除data.table中包含相同值的3个连续行

Question

I have a data table in R with 3 features as follow 我在R中有一个具有3个功能的数据表，如下所示

DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                                                          "2014-06-23","2014-06-24","2014-06-25","2014-06-26")), 
               Status1 = c("A","B","A","A","B","A","A","A","B","B"))

The data looks like this 数据看起来像这样

    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B

How can i check the Status 1 and see if there are 3 rows in a row that has value A (like row 6,7,8) then we will delete these? 我如何检查状态1并查看一行中是否有3行具有值A（例如第6、7、8行），然后我们将其删除？

Answer 1

The question is tagged data.table , so I'll try to give an appropriate answer: 问题是标记为data.table ，所以我将尝试给出适当的答案：

DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]

  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B

Other test cases 其他测试用例

As pointed out by Frank , my first answer (now edited) was working just for the given sample data set provided by the OP but failed for other test cases. 正如Frank指出的那样，我的第一个答案（现已编辑） 仅适用于OP提供的给定样本数据集 ，但不适用于其他测试用例。

So, the edited code is applied to some other test cases. 因此，编辑后的代码将应用于其他一些测试用例。

Case B: 3 consecutive rows of letters A and B 情况B：字母A和B连续3行

DT_B <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B

  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 A 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 B

DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]

  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B 8: 3 2014-06-26 B

Only the 3 consecutive rows containing letter A (rows 6 to 8) are removed. 仅删除包含字母A （行6至8）的3个连续行。

Case C: Nothing to remove 情况C：一无所有

DT_C <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C

  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C

DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]

  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C

No row is removed as there are no 3 consecutive rows containing A . 没有行被删除，因为没有3个连续的行包含A

Case D: Edge case: remove all rows 情况D：边缘情况：删除所有行

DT_D <- DT_A[6:8]
DT_D

  sid date Status1 1: 3 2014-06-22 A 2: 3 2014-06-23 A 3: 2 2014-06-24 A

DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]

 Empty data.table (0 rows) of 3 cols: sid,date,Status1

All rows are removed and an empty data.table is returned because the input data.table consists only of 3 rows with letter A . 因为输入data.table仅由3个带有字母A行组成，所以将删除所有行并返回空的data.table。

Answer 2

with(rle(DT_A$Status1 == "A"), {
    unlist(lapply(which(lengths >= 3), function(i)
        (1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8

Answer 3

I am supposing you are making a mistake in your sid definition, and that your 3 lines have all sid = 3. If not, sorry my answer will not work. 我想您在sid定义中犯了一个错误，并且您的3行的所有sid =3。如果没有，抱歉，我的回答将不起作用。 If it is the case the solution can be one line: 如果是这种情况，解决方案可以是一行：

 DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]

Is a simple line that does what you want : it select the data where the number of line is less than 3 or different than B in column Status1 (that is the negation of your selection you want to make to delete : at least 3 A) when grouping by sid and Status1. 这是一条简单的行，可以满足您的要求：它选择行数少于3或不同于Status1列中B的数据（即您要删除的选择的取反：至少3 A）通过sid和Status1分组时。 Hope it helps 希望能帮助到你

如何删除data.table中包含相同值的3个连续行

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-11-07 16:22:03

Other test cases 其他测试用例

解决方案2
1 2017-11-07 15:49:48

解决方案3
1 2017-11-07 17:39:12

如何删除data.table中包含相同值的3个连续行

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-11-07 16:22:03

Other test cases 其他测试用例

解决方案2 1 2017-11-07 15:49:48

解决方案3 1 2017-11-07 17:39:12

解决方案1
2 已采纳 2017-11-07 16:22:03

解决方案2
1 2017-11-07 15:49:48

解决方案3
1 2017-11-07 17:39:12