[英]R: fast (conditional) subsetting where feasible
I would like to subset rows of my data 我想对我的数据行进行子集化
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id x y z
1: 1 109.3400 208.6732 308.7595
2: 2 101.6920 201.0989 310.1080
3: 3 119.4697 217.8550 313.9384
4: 4 111.4261 205.2945 317.3651
5: 5 100.4024 212.2826 305.1375
6: 6 114.4711 203.6988 319.4913
in several stages. 分几个阶段。 I am aware that I could apply
subset(.)
sequentially to achieve this. 我知道我可以按顺序应用
subset(.)
来实现这一点。
> s <- subset(dat, x>119)
> s <- subset(s, y>219)
> subset(s, z>315)
id x y z
1: 55 119.2634 219.0044 315.6556
My problem is that I need to automate this and it might happen that the subset is empty. 我的问题是我需要自动执行此操作,并且可能会发生子集为空的情况。 In this case, I would want to skip the step(s) that result in an empty set.
在这种情况下,我想跳过导致空集的步骤。 For example, if my data was
例如,如果我的数据是
dat2 <- dat[1:50]
> s <-subset(dat2,x>119)
> s
id x y z
1: 3 119.4697 217.8550 313.9384
2: 50 119.2519 214.2517 318.8567
the second step subset(s, y>219)
would come up empty but I would still want to apply the third step subset(s,z>315)
. 第二步
subset(s, y>219)
会出现空白,但我仍然想要应用第三步子subset(s,z>315)
。 Is there a way to apply a subset-command only if it results in a non-empty set? 有没有办法只应用子集命令导致非空集? I imagine something like
subset(s, y>219, nonzero=TRUE)
. 我想像
subset(s, y>219, nonzero=TRUE)
。 I would want to avoid constructions like 我想避免像这样的结构
s <- dat
if(nrow(subset(s, x>119))>0){s <- subset(s, x>119)}
if(nrow(subset(s, y>219))>0){s <- subset(s, y>219)}
if(nrow(subset(s, z>318))>0){s <- subset(s, z>319)}
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.)
. 因为我担心if-then丛林会相当慢,特别是因为我需要使用
lapply(.)
将所有这些应用于列表中的不同data.tables。 That's why I am hoping to find a solution optimized for speed. 这就是我希望找到针对速度优化的解决方案的原因。
PS. PS。 I only chose
subset(.)
for clarity, solutions with eg data.table would be just as welcome if not more so. 为清晰起见,我只选择
subset(.)
,如果不是更多,那么使用例如data.table的解决方案也会受到欢迎。
I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. 我同意康拉德的回答,这应该发出警告或者至少报告某种情况会发生什么。 Here's a data.table way that will take advantage of indices (see package vignettes for details):
这是一个利用索引的data.table方法(有关详细信息,请参阅包装插图):
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
Usage 用法
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. 详细选项将打印data.table包提供的额外信息,因此您可以查看索引的使用时间。 For example, with
f(dat, x == 119, verbose=TRUE)
, I see it. 例如,使用
f(dat, x == 119, verbose=TRUE)
,我看到它。
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).
因为我担心if-then丛林会相当慢,特别是因为我需要使用lapply(。)将所有这些应用于列表中的不同data.tables。
If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x)
to more easily keep track of what the query was and what happened. 如果是非交互式使用,可能最好使用函数返回
list(mon = mon, x = x)
来更容易地跟踪查询是什么以及发生了什么。 Also, the verbose console output could be captured and returned. 此外,可以捕获并返回详细的控制台输出。
An interesting approach could be developed using modified filter
function offered in dplyr
. 可以使用
dplyr
提供的修改filter
功能开发一种有趣的方法。 In case of conditions not being met the non_empty_filter
filter function returns original data set. 如果条件不满足,则
non_empty_filter
过滤器函数将返回原始数据集。
warning
. warning
报告。 Of course, this can be removed and has no bearing on the function results. library(tidyverse)
library(rlang) # enquo
non_empty_filter <- function(df, expr) {
expr <- enquo(expr)
res <- df %>% filter(!!expr)
if (nrow(res) > 0) {
return(res)
} else {
# Indicate that filter is not applied
warning("No rows meeting conditon")
return(df)
}
}
Behaviour: Returning one row for which the condition is met. 行为:返回满足条件的一行。
dat %>%
non_empty_filter(x > 119 & y > 219)
# id x y z
# 1 55 119.2634 219.0044 315.6556
Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6
. 行为:由于
y > 1e6
因为不满足整个条件,则返回完整数据集。
dat %>%
non_empty_filter(x > 119 & y > 219 & y > 1e6)
# id x y z
# 1: 1 109.3400 208.6732 308.7595
# 2: 2 101.6920 201.0989 310.1080
# 3: 3 119.4697 217.8550 313.9384
# 4: 4 111.4261 205.2945 317.3651
# 5: 5 100.4024 212.2826 305.1375
# 6: 6 114.4711 203.6988 319.4913
# 7: 7 112.1879 209.5716 319.6732
# 8: 8 106.1344 202.2453 312.9427
# 9: 9 101.2702 210.5923 309.2864
# 10: 10 106.1071 211.8266 301.0645
Behaviour: Skipping filter that would return an empty data set. 行为:跳过将返回空数据集的过滤器。
dat %>%
non_empty_filter(y > 1e6) %>%
non_empty_filter(x > 119) %>%
non_empty_filter(y > 219)
# id x y z
# 1 55 119.2634 219.0044 315.6556
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.