简体   繁体   English

如何删除data.table中包含相同值的3个连续行

[英]How to delete 3 consecutive rows that contain the same value in a data.table

I have a data table in R with 3 features as follow 我在R中有一个具有3个功能的数据表,如下所示

DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                                                          "2014-06-23","2014-06-24","2014-06-25","2014-06-26")), 
               Status1 = c("A","B","A","A","B","A","A","A","B","B"))

The data looks like this 数据看起来像这样

    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B

How can i check the Status 1 and see if there are 3 rows in a row that has value A (like row 6,7,8) then we will delete these? 我如何检查状态1并查看一行中是否有3行具有值A(例如第6、7、8行),然后我们将其删除?

The question is tagged data.table , so I'll try to give an appropriate answer: 问题是标记为data.table ,所以我将尝试给出适当的答案:

DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B 

Other test cases 其他测试用例

As pointed out by Frank , my first answer (now edited) was working just for the given sample data set provided by the OP but failed for other test cases. 正如Frank指出的那样,我的第一个答案(现已编辑) 仅适用于OP提供的给定样本数据集 ,但不适用于其他测试用例。

So, the edited code is applied to some other test cases. 因此,编辑后的代码将应用于其他一些测试用例。

Case B: 3 consecutive rows of letters A and B 情况B:字母AB连续3行

DT_B <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B
  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 A 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 B 
DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B 8: 3 2014-06-26 B 

Only the 3 consecutive rows containing letter A (rows 6 to 8) are removed. 仅删除包含字母A (行6至8)的3个连续行。

Case C: Nothing to remove 情况C:一无所有

DT_C <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C
  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C 
DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
  sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C 

No row is removed as there are no 3 consecutive rows containing A . 没有行被删除,因为没有3个连续的行包含A

Case D: Edge case: remove all rows 情况D:边缘情况:删除所有行

DT_D <- DT_A[6:8]
DT_D
  sid date Status1 1: 3 2014-06-22 A 2: 3 2014-06-23 A 3: 2 2014-06-24 A 
DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
 Empty data.table (0 rows) of 3 cols: sid,date,Status1 

All rows are removed and an empty data.table is returned because the input data.table consists only of 3 rows with letter A . 因为输入data.table仅由3个带有字母A行组成,所以将删除所有行并返回空的data.table。

with(rle(DT_A$Status1 == "A"), {
    unlist(lapply(which(lengths >= 3), function(i)
        (1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8

I am supposing you are making a mistake in your sid definition, and that your 3 lines have all sid = 3. If not, sorry my answer will not work. 我想您在sid定义中犯了一个错误,并且您的3行的所有sid =3。如果没有,抱歉,我的回答将不起作用。 If it is the case the solution can be one line: 如果是这种情况,解决方案可以是一行:

 DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]

Is a simple line that does what you want : it select the data where the number of line is less than 3 or different than B in column Status1 (that is the negation of your selection you want to make to delete : at least 3 A) when grouping by sid and Status1. 这是一条简单的行,可以满足您的要求:它选择行数少于3或不同于Status1列中B的数据(即您要删除的选择的取反:至少3 A)通过sid和Status1分组时。 Hope it helps 希望能帮助到你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 重新排序 data.table 行,使连续的两行不相同 - Reorder data.table rows so that no two consecutive rows are the same 同一组非连续记录的行之间的数据表差异 - data.table difference between rows of non-consecutive records of same group Dplyr或data.table根据另一列中的值合并分组数据中的连续行 - Dplyr or data.table consolidate consecutive rows within grouped data based on value in another column 如何基于同一行中的值移动data.table行中的值 - How to shift values in rows of data.table based on value in the same row 比较data.table中的连续行并替换行值 - Compare consecutive rows in data.table and replace row values R data.table 的连续行之间的快速余弦距离 - R fast cosine distance between consecutive rows of a data.table 如果 data.table 中的选定行与另一个 data.table 中的值匹配,如何更新它们 - How to update selected rows in a data.table if they match value from another data.table 如何用相同维度的另一个数据表的值替换一个数据表中的某个值 - How to replace a certain value in one data.table with values of another data.table of same dimension r data.table-排除行中包含某些值的组 - r data.table - exclude groups that contain certain values in rows 如何在r data.table中找到包含任意值的向量的行的索引? - How can I find the index of rows that contain a vector of values in any order in an r data.table?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM