[英]How to delete 3 consecutive rows that contain the same value in a data.table
I have a data table in R with 3 features as follow 我在R中有一个具有3个功能的数据表,如下所示
DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","A","B","B"))
The data looks like this 数据看起来像这样
sid date Status1
1: 1 2014-06-22 A
2: 1 2014-06-23 B
3: 2 2014-06-22 A
4: 2 2014-06-23 A
5: 2 2014-06-24 B
6: 3 2014-06-22 A
7: 3 2014-06-23 A
8: 2 2014-06-24 A
9: 3 2014-06-25 B
10: 3 2014-06-26 B
How can i check the Status 1 and see if there are 3 rows in a row that has value A (like row 6,7,8) then we will delete these? 我如何检查状态1并查看一行中是否有3行具有值A(例如第6、7、8行),然后我们将其删除?
The question is tagged data.table
, so I'll try to give an appropriate answer: 问题是标记为data.table
,所以我将尝试给出适当的答案:
DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B
As pointed out by Frank , my first answer (now edited) was working just for the given sample data set provided by the OP but failed for other test cases. 正如Frank指出的那样,我的第一个答案(现已编辑) 仅适用于OP提供的给定样本数据集 ,但不适用于其他测试用例。
So, the edited code is applied to some other test cases. 因此,编辑后的代码将应用于其他一些测试用例。
Case B: 3 consecutive rows of letters A
and B
情况B:字母A
和B
连续3行
DT_B <- data.table(
sid=c(1,1,2,2,2,3,3,2,3,3,3),
date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 A 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 B
DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B 8: 3 2014-06-26 B
Only the 3 consecutive rows containing letter A
(rows 6 to 8) are removed. 仅删除包含字母A
(行6至8)的3个连续行。
Case C: Nothing to remove 情况C:一无所有
DT_C <- data.table(
sid=c(1,1,2,2,2,3,3,2,3,3,3),
date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C
DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C
No row is removed as there are no 3 consecutive rows containing A
. 没有行被删除,因为没有3个连续的行包含A
Case D: Edge case: remove all rows 情况D:边缘情况:删除所有行
DT_D <- DT_A[6:8]
DT_D
sid date Status1 1: 3 2014-06-22 A 2: 3 2014-06-23 A 3: 2 2014-06-24 A
DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
Empty data.table (0 rows) of 3 cols: sid,date,Status1
All rows are removed and an empty data.table is returned because the input data.table consists only of 3 rows with letter A
. 因为输入data.table仅由3个带有字母A
行组成,所以将删除所有行并返回空的data.table。
with(rle(DT_A$Status1 == "A"), {
unlist(lapply(which(lengths >= 3), function(i)
(1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8
I am supposing you are making a mistake in your sid definition, and that your 3 lines have all sid = 3. If not, sorry my answer will not work. 我想您在sid定义中犯了一个错误,并且您的3行的所有sid =3。如果没有,抱歉,我的回答将不起作用。 If it is the case the solution can be one line: 如果是这种情况,解决方案可以是一行:
DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]
Is a simple line that does what you want : it select the data where the number of line is less than 3 or different than B in column Status1 (that is the negation of your selection you want to make to delete : at least 3 A) when grouping by sid and Status1. 这是一条简单的行,可以满足您的要求:它选择行数少于3或不同于Status1列中B的数据(即您要删除的选择的取反:至少3 A)通过sid和Status1分组时。 Hope it helps 希望能帮助到你
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.