寻找一种基于多变量组内的值差异来过滤data.table的更简洁方法

Question

I was working on a problem where I had two grouping variables and one value. 我正在研究一个问题，我有两个分组变量和一个值。 I only to keep the rows were at least two of the values in the group are close to each other in value. 我只保留行中至少有两个值在组中的值彼此接近。 In the example I wanted groups that had one set of values within 10 of each other. 在示例中，我希望组中的一组值彼此相差10。

Below is what I initially tried, and something about making a flag variable made me feel like I was doing it in some roundabout way, and I just wanted to know if there's a cleaner more intended way to do something like this in data.table. 下面是我最初尝试的内容，以及制作一个标志变量让我觉得我是以一种迂回的方式做的，我只是想知道是否有更清洁的更有意义的方法在data.table中做这样的事情。 Thank you! 谢谢！

x and y are the categories, z is the value. x和y是类别，z是值。

library(data.table)
set.seed(123)


dt <- data.table(
  x = sample(LETTERS, 1000, T),
  y = sample(letters, 1000, T),
  z = sample(100, 1000, T),
  key = tail(letters, 3)
)

dt <- unique(dt)
dt <- dt[dt[, .(flag = any(diff(z) <= 11)), .(x, y)], on = c("x", "y")][(flag)]
dt[, flag := NULL]
dt

Answer 1

You can use .I with an if to determine whether to include each group (here want matches your final dt ) 您可以使用.I和if来确定是否包含每个组（这里want匹配您的最终dt ）

dt <- unique(dt)
want <- dt[dt[, if(any(diff(z) <= 11)) .I, .(x, y)]$V1]

Answer 2

You could do 你可以做到

res <- dt[, if (.N > 1L && min(diff(z)) <= 11) .SD, by=.(x, y)]

I used min instead of any since I guess it leads to fewer computations. 我使用min而不是any因为我猜它会导致更少的计算。

I added the .N > 1L condition since you need to think about how to handle the single row case (where diff is NA). 我添加了.N > 1L条件，因为你需要考虑如何处理单行情况（diff是NA）。 You could do 你可以做到

.N > 1L && to drop those cases .N > 1L &&放弃这些案件
.N == 1L || to keep them 留住他们

I just wanted to know if there's a cleaner more intended way to do something like this in data.table 我只是想知道在data.table中是否有更清晰的更有意义的方法来做这样的事情

I think a having= syntax would be convenient for this. 我认为有一个having=语法对此很方便。 It's currently a feature request . 这是一个功能请求。

Input data (since OP overwrites it): 输入数据（因为OP会覆盖它）：

library(data.table)
set.seed(123)
dt <- data.table(
  x = sample(LETTERS, 1000, T),
  y = sample(letters, 1000, T),
  z = sample(100, 1000, T),
  key = tail(letters, 3)
)
dt <- unique(dt)

寻找一种基于多变量组内的值差异来过滤data.table的更简洁方法

问题描述

2 个解决方案

解决方案1
4 2019-08-15 16:05:57

解决方案2
3 已采纳 2019-08-15 19:25:56

寻找一种基于多变量组内的值差异来过滤data.table的更简洁方法

问题描述

2 个解决方案

解决方案1 4 2019-08-15 16:05:57

解决方案2 3 已采纳 2019-08-15 19:25:56

解决方案1
4 2019-08-15 16:05:57

解决方案2
3 已采纳 2019-08-15 19:25:56