简体   繁体   English

寻找一种基于多变量组内的值差异来过滤data.table的更简洁方法

[英]Looking for a cleaner way of filtering a data.table based on value differences within a multi-variable group

I was working on a problem where I had two grouping variables and one value. 我正在研究一个问题,我有两个分组变量和一个值。 I only to keep the rows were at least two of the values in the group are close to each other in value. 我只保留行中至少有两个值在组中的值彼此接近。 In the example I wanted groups that had one set of values within 10 of each other. 在示例中,我希望组中的一组值彼此相差10。

Below is what I initially tried, and something about making a flag variable made me feel like I was doing it in some roundabout way, and I just wanted to know if there's a cleaner more intended way to do something like this in data.table. 下面是我最初尝试的内容,以及制作一个标志变量让我觉得我是以一种迂回的方式做的,我只是想知道是否有更清洁的更有意义的方法在data.table中做这样的事情。 Thank you! 谢谢!

x and y are the categories, z is the value. x和y是类别,z是值。

library(data.table)
set.seed(123)


dt <- data.table(
  x = sample(LETTERS, 1000, T),
  y = sample(letters, 1000, T),
  z = sample(100, 1000, T),
  key = tail(letters, 3)
)

dt <- unique(dt)
dt <- dt[dt[, .(flag = any(diff(z) <= 11)), .(x, y)], on = c("x", "y")][(flag)]
dt[, flag := NULL]
dt

You can use .I with an if to determine whether to include each group (here want matches your final dt ) 您可以使用.Iif来确定是否包含每个组(这里want匹配您的最终dt

dt <- unique(dt)
want <- dt[dt[, if(any(diff(z) <= 11)) .I, .(x, y)]$V1]

You could do 你可以做到

res <- dt[, if (.N > 1L && min(diff(z)) <= 11) .SD, by=.(x, y)]

I used min instead of any since I guess it leads to fewer computations. 我使用min而不是any因为我猜它会导致更少的计算。

I added the .N > 1L condition since you need to think about how to handle the single row case (where diff is NA). 我添加了.N > 1L条件,因为你需要考虑如何处理单行情况(diff是NA)。 You could do 你可以做到

  • .N > 1L && to drop those cases .N > 1L &&放弃这些案件
  • .N == 1L || to keep them 留住他们

I just wanted to know if there's a cleaner more intended way to do something like this in data.table 我只是想知道在data.table中是否有更清晰的更有意义的方法来做这样的事情

I think a having= syntax would be convenient for this. 我认为有一个having=语法对此很方便。 It's currently a feature request . 这是一个功能请求


Input data (since OP overwrites it): 输入数据(因为OP会覆盖它):

library(data.table)
set.seed(123)
dt <- data.table(
  x = sample(LETTERS, 1000, T),
  y = sample(letters, 1000, T),
  z = sample(100, 1000, T),
  key = tail(letters, 3)
)
dt <- unique(dt)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM