[英]Looking for a cleaner way of filtering a data.table based on value differences within a multi-variable group
I was working on a problem where I had two grouping variables and one value. 我正在研究一个问题,我有两个分组变量和一个值。 I only to keep the rows were at least two of the values in the group are close to each other in value. 我只保留行中至少有两个值在组中的值彼此接近。 In the example I wanted groups that had one set of values within 10 of each other. 在示例中,我希望组中的一组值彼此相差10。
Below is what I initially tried, and something about making a flag variable made me feel like I was doing it in some roundabout way, and I just wanted to know if there's a cleaner more intended way to do something like this in data.table. 下面是我最初尝试的内容,以及制作一个标志变量让我觉得我是以一种迂回的方式做的,我只是想知道是否有更清洁的更有意义的方法在data.table中做这样的事情。 Thank you! 谢谢!
x and y are the categories, z is the value. x和y是类别,z是值。
library(data.table)
set.seed(123)
dt <- data.table(
x = sample(LETTERS, 1000, T),
y = sample(letters, 1000, T),
z = sample(100, 1000, T),
key = tail(letters, 3)
)
dt <- unique(dt)
dt <- dt[dt[, .(flag = any(diff(z) <= 11)), .(x, y)], on = c("x", "y")][(flag)]
dt[, flag := NULL]
dt
You can use .I
with an if
to determine whether to include each group (here want
matches your final dt
) 您可以使用.I
和if
来确定是否包含每个组(这里want
匹配您的最终dt
)
dt <- unique(dt)
want <- dt[dt[, if(any(diff(z) <= 11)) .I, .(x, y)]$V1]
You could do 你可以做到
res <- dt[, if (.N > 1L && min(diff(z)) <= 11) .SD, by=.(x, y)]
I used min
instead of any
since I guess it leads to fewer computations. 我使用min
而不是any
因为我猜它会导致更少的计算。
I added the .N > 1L
condition since you need to think about how to handle the single row case (where diff is NA). 我添加了.N > 1L
条件,因为你需要考虑如何处理单行情况(diff是NA)。 You could do 你可以做到
.N > 1L &&
to drop those cases .N > 1L &&
放弃这些案件 .N == 1L ||
to keep them 留住他们 I just wanted to know if there's a cleaner more intended way to do something like this in data.table 我只是想知道在data.table中是否有更清晰的更有意义的方法来做这样的事情
I think a having=
syntax would be convenient for this. 我认为有一个having=
语法对此很方便。 It's currently a feature request . 这是一个功能请求 。
Input data (since OP overwrites it): 输入数据(因为OP会覆盖它):
library(data.table)
set.seed(123)
dt <- data.table(
x = sample(LETTERS, 1000, T),
y = sample(letters, 1000, T),
z = sample(100, 1000, T),
key = tail(letters, 3)
)
dt <- unique(dt)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.