简体   繁体   English

使用 data.table 按组删除特定列中具有前导缺失值的行

[英]Remove rows with leading missing values in a specific column by group with data.table

I have a data.table like this:我有一个像这样的 data.table:

DT <- data.table(id = c(rep("a", 3), rep("b", 3)),
                 col1 = c(NA,1,2,NA,3,NA), col2 = c(NA,NA,5,NA,NA,NA))
   id col1 col2
1:  a   NA   NA
2:  a    1   NA
3:  a    2    5
4:  b   NA   NA
5:  b    3   NA
6:  b   NA   NA

For each id, I would like to remove rows with leading NA s in 'col1' using zoo::na.trim .对于每个 id,我想使用zoo::na.trim删除 'col1' 中带有前导NA的行。 Here's the result I'm expecting:这是我期待的结果:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

Here's what I have tried so far.这是我到目前为止所尝试的。 This indeed removes leading NA in 'col1', but it omits 'col2' from the result:这确实删除了“col1”中的前导NA ,但它从结果中省略了“col2”:

DT[ , na.trim(col1), by = id]
   id V1
1:  a  1
2:  a  2
3:  b  3

This is also not working:这也不起作用:

DT[ , .SD[na.trim(col1)], by = id]
   id col1 col2
1:  a   NA   NA
2:  a    1   NA
3:  b   NA   NA

A possible solution without using the zoo -package: 不使用zoo -package的可能解决方案:

DT[DT[, .I[!!cumsum(!is.na(col1))], by = id]$V1]

you get: 你得到:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

What this does: 这是做什么的:

  • With DT[, .I[!!cumsum(!is.na(col1))], id]$V1 you create a vector of rownumbers to keep. 使用DT[, .I[!!cumsum(!is.na(col1))], id]$V1您可以创建一个rownumbers矢量来保存。 By using !!cumsum(!is.na(col1)) you make sure that only the leading missing values of col1 are omitted. 通过使用!!cumsum(!is.na(col1))您可以确保只省略col1缺失值。
  • Next you use that vector to subset the data.table. 接下来,您使用该向量来对data.table进行子集化。
  • !!cumsum(!is.na(col1)) does the same as cumsum(!is.na(col1))!=0 . !!cumsum(!is.na(col1))cumsum(!is.na(col1))!=0 Using !! 使用!! converts all number higher than zero to TRUE and all zeros to FALSE . 将所有大于零的数字转换为TRUE ,将所有零转换为FALSE
  • .I isn't necessarily needed, you can also use: DT[DT[, !!cumsum(!is.na(col1)), by = id]$V1] which subsets the data.table with a logical vector. .I不一定需要,你也可以使用: DT[DT[, !!cumsum(!is.na(col1)), by = id]$V1] ,它使用逻辑向量对data.table进行子集化。

Two alternatives with cummax by @lmo from the comments: 来自评论的cummax的两个替代品:cummax:

# alternative 1:
DT[DT[, !!(cummax(!is.na(col1))), by = id]$V1]

# alternative 2:
DT[as.logical(DT[, cummax(!is.na(col1)), by = id]$V1)]

Another alternative by @jogo: @jogo的另一个选择:

DT[, .SD[!!cumsum(!is.na(col1))], by = id]

Another alternative by @Frank: @Frank的另一个选择:

DT[, .SD[ rleid(col1) > 1L | !is.na(col1) ], by = id]

na.trim would be used like this with data.table. na.trim将与data.table一样使用。 See ?na.trim for more info on its arguments. 有关其参数的更多信息,请参阅?na.trim

DT[, na.trim(.SD, sides = "left", is.na = "all"), by = id]

giving: 赠送:

   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

ADDED: 添加:

In comment poster clarified that only column 1 NAs should be operated on by na.trim . 在评论中,海报澄清说,只有第1列na.trim操作。 In that case append a column of row numbers, .I, and after involing na.trim subset using those row numbers. 在这种情况下,添加一列行号,.I,并在使用这些行号后使用na.trim子集。

DT[DT[, na.trim(data.table(col1, .I), "left"), by = id]$.I, ]

We can use 1:.N >= which.max(...) to subset the required rows我们可以使用1:.N >= which.max(...)来子集所需的行

> DT[, .SD[1:.N >= which.max(!is.na(col1))], id]
   id col1 col2
1:  a    1   NA
2:  a    2    5
3:  b    3   NA
4:  b   NA   NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM