简体   繁体   English

从 R 中的 data.table 有条件地删除行

[英]Remove rows conditionally from a data.table in R

I have a data.table with fields {id, menuitem, amount}.我有一个带有 {id, menuitem, amount} 字段的 data.table。

This is transaction data - so, ids are unique, but menuitem repeats.这是交易数据 - 因此,id 是唯一的,但 menuitem 重复。 Now, I want to remove all entries where menuitem == 'coffee' .现在,我想删除menuitem == 'coffee'所有条目。

Also, want to delete all rows where amount <= 0 ;另外,想要删除所有amount <= 0行;

What is the right way to do this in data.table?在 data.table 中执行此操作的正确方法是什么?

I can use data$menuitem!='coffee' and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.我可以使用data$menuitem!='coffee'然后将 int 索引到 data[] - 但这不一定有效并且没有利用 data.table。

Any pointers in the right direction are appreciated.任何指向正确方向的指针都值得赞赏。

In this scenario it is not so different than data.frame在这种情况下,它与data.frame没有太大区别

data <- data[ menuitem != 'coffee' | amount > 0] 

Delete/add row by reference it is to be implemented.通过引用删除/添加行它是要实现的。 You find more info in this question您可以在此问题中找到更多信息

Regarding speed:关于速度:

1 You can benefit from keys by doing something like: 1 您可以通过执行以下操作从密钥中受益:

setkey(data, menuitem)
data <- data[!"coffee"]

which will be faster than data <- data[ menuitem != 'coffee'] .这将比data <- data[ menuitem != 'coffee']快。 However to apply the same filters you asked in the question you'll need a rolling join (I've finished my lunch break I can add something later :-)).但是,要应用您在问题中提出的相同过滤器,您需要滚动连接(我已经完成了午休时间,我可以稍后添加一些内容:-))。

2 Even without key data.table is much faster for relatively big table (similar speed for handful amount of rows) 2 即使没有关键 data.table 对于相对较大的表来说也快得多(少量行的速度相似)

dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
library(microbenchmark)
> microbenchmark(dt[ id == "a"], df[ df$id == "a",])
Unit: milliseconds
               expr       min        lq    median        uq       max neval
      dt[id == "a"]  24.42193  25.74296  26.00996  26.35778  27.36355   100
 df[df$id == "a", ] 138.17500 146.46729 147.38646 149.06766 154.10051   100

try this:尝试这个:

data <- data[ !(menuitem == 'coffee' | amount <= 0),] 

Generally:一般来说:

dt <- data.table(a=c(1,1,1,2,2,2,3,3,3),b=c(4,2,3,1,5,3,4,7,6))
dt
#>    a b
#> 1: 1 4
#> 2: 1 2
#> 3: 1 3
#> 4: 2 1
#> 5: 2 5
#> 6: 2 3
#> 7: 3 4
#> 8: 3 7
#> 9: 3 6
dt[a!=1,]
#>    a b
#> 1: 2 1
#> 2: 2 5
#> 3: 2 3
#> 4: 3 4
#> 5: 3 7
#> 6: 3 6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM