[英]Fast ways to subset categorical data in R with multiple conditions
I have a large dataset in R (say >40,000 rows and >20 categorical columns) that I repeatedly subset, so I would like to speed this up as much as possible. 我在R中有一个大型数据集(例如,> 40,000行和> 20个类别列),我会重复对其进行子集设置,因此我想尽可能地加快速度。 It needs to be a general function (each categorical column has a discrete number of possible values, say in string format).
它必须是一个通用函数(每个分类列都有离散数量的可能值,例如字符串格式)。
Each time I subset, I need to identify the subset of rows that satisfy multiple logical set membership conditions (eg >10 conditions). 每次我设置子集时,我都需要标识满足多个逻辑集成员条件(例如> 10个条件)的行子集。 Ie, I need to check several columns and check if values in that column match a certain set membership (hence the use of
%in%
). 即,我需要检查几列,并检查该列中的值是否与某个集合成员身份匹配(因此使用
%in%
)。
# simple dataset example
library(dplyr)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
I've looked around the internet and SO a lot, but haven't found the discussion of performance I'm looking for. 我已经在互联网上四处张望,但没有找到我想要的性能讨论。 I mostly code using
dplyr
, so apologies if there's a clear performance improvement here in data.table
; 我主要是使用代码
dplyr
,所以道歉,如果在这里有一个明显的性能改进data.table
; I've tried some simple benchmarks between the two (but without using any data.table
indexing or etc.) and it's not obvious if one is faster. 我已经尝试了两者之间的一些简单基准测试(但未使用任何
data.table
索引等),并且如果速度更快,则并不明显。
Example options I've considered (since I'm not great at data.table
, I've excluded data.table
options from here): 我考虑过的示例选项(由于我不
data.table
,因此我从这里排除了data.table
选项):
base_filter <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- dat[dat[[col_name]] %in% sample(letters[1:10], size = 4), ]
}
dat
}
dplyr_filter1 <- function(dat) {
for (i in 1:7) {
col_name <- paste0('col', i)
dat <- filter_(dat,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4)))
}
dat
}
dplyr_filter2 <- function(dat) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = sample(letters[1:10], size = 4))
}
filter_(dat, .dots = dots_filter)
}
Note: In practice, on my real datasets, dplyr_filter2
actually works fastest. 注意:实际上,在我的真实数据集上,
dplyr_filter2
实际上运行最快。 I've also tried dtplyr
or converting my data to a data.table
, but this seems slower than without. 我也尝试过
dtplyr
或将我的数据转换为data.table
,但这似乎比没有它慢。
Note: On the other hand, in practice, the base R function outperforms the dplyr
examples when data has fewer rows and fewer columns (perhaps due to copying speed?). 注意:另一方面,实际上,当数据具有更少的行和更少的列(也许是由于复制速度?)时,基本R函数的性能优于
dplyr
示例。
Thus, I'd like to ask SO what the general, most efficient way(s) to subset a categorical dataframe under multiple (set membership) conditions is. 因此,我想问一下在多个(集合成员资格)条件下,对分类数据帧进行子集化的最通用,最有效的方法是什么。 And if possible, explain the mechanics for why?
如果可能,请解释其原因? Does this answer differ for smaller datasets?
对于较小的数据集,此答案是否有所不同? Does it depend on copying time or search time?
它取决于复制时间还是搜索时间?
Useful related links 有用的相关链接
Understand that you prefer not to use data.table. 了解您不希望使用data.table。 Just providing some timings for reference below.
请在下面提供一些参考时间。 With indexing, subsetting can be performed much faster and inner join of the 2 tables can also be done easily in
data.table
. 使用索引,子集可以更快地执行,并且两个表的内部
data.table
也可以在data.table
轻松data.table
。
# simple dataset example
library(dplyr)
library(lazyeval)
set.seed(0L)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))
selection <- lapply(1:7, function(n) sample(letters[1:10], size = 4))
base_filter <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- df[df[[col_name]] %in% selection[[i]], ]
}
df
}
dplyr_filter1 <- function(df) {
for (i in 1:7) {
col_name <- paste0('col', i)
df <- filter_(df,
.dots = interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]]))
}
df
}
dplyr_filter2 <- function(df) {
dots_filter <- list()
for (i in 1:7) {
col_name <- paste0('col', i)
dots_filter[[i]] <- interp(~ colname %in% vals,
colname = as.name(col_name),
vals = selection[[i]])
}
filter_(df, .dots = dots_filter)
}
library(data.table)
#convert data.frame into data.table
dt <- data.table(dat, key=names(dat)[1:7])
#create the sets of selection
dtSelection <- data.table(expand.grid(selection, stringsAsFactors=FALSE))
library(microbenchmark)
microbenchmark(
base_filter(dat),
dplyr_filter1(dat),
dplyr_filter2(dat),
dt[dtSelection, nomatch=0], #perform inner join between dataset and selection
times=5L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# base_filter(dat) 27.084801 27.870702 35.849261 32.045900 32.872601 59.372301 5
# dplyr_filter1(dat) 23.130100 24.114301 26.922081 24.860701 29.804301 32.701002 5
# dplyr_filter2(dat) 29.641101 30.686002 32.363681 31.103000 31.884701 38.503601 5
# dt[dtSelection, nomatch = 0] 3.626001 3.646201 3.829341 3.686601 3.687001 4.500901 5
In addition to chinsoon12's alternatives, one thing to consider is to avoid subsetting the data.frame in each iteration. 除了chinsoon12的替代方案之外,要考虑的一件事是避免在每次迭代中对data.frame进行子集设置。 So, instead of
所以,代替
f0 = function(x, cond)
{
for(j in seq_along(x)) x = x[x[[j]] %in% cond[[j]], ]
return(x)
}
one alternative is to accumulate a logical vector of whether to include each row in the final subset: 一种选择是累积是否在最终子集中包括每一行的逻辑向量:
f1 = function(x, cond)
{
i = rep_len(TRUE, nrow(x))
for(j in seq_along(x)) i = i & (x[[j]] %in% cond[[j]])
return(x[i, ])
}
or, another alternative, is to iteratively reduce the amount of comparisons, but by reducing the row indices instead of the data.frame itself: 或者,另一种方法是迭代地减少比较的数量,但是要减少行索引而不是data.frame本身:
f2 = function(x, cond)
{
i = 1:nrow(x)
for(j in seq_along(x)) i = i[x[[j]][i] %in% cond[[j]]]
return(x[i, ])
}
And a comparison with data: 并与数据进行比较:
set.seed(1821)
dat = as.data.frame(replicate(30, sample(c(letters, LETTERS), 5e5, TRUE), FALSE),
stringsAsFactors = FALSE)
conds = replicate(ncol(dat), sample(c(letters, LETTERS), 48), FALSE)
system.time({ ans0 = f0(dat, conds) })
# user system elapsed
# 3.44 0.28 3.86
system.time({ ans1 = f1(dat, conds) })
# user system elapsed
# 0.66 0.01 0.68
system.time({ ans2 = f2(dat, conds) })
# user system elapsed
# 0.34 0.01 0.39
identical(ans0, ans1)
#[1] TRUE
identical(ans1, ans2)
#[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.