简体   繁体   English

在多个条件下快速分类R中类别数据的快速方法

[英]Fast ways to subset categorical data in R with multiple conditions

I have a large dataset in R (say >40,000 rows and >20 categorical columns) that I repeatedly subset, so I would like to speed this up as much as possible. 我在R中有一个大型数据集(例如,> 40,000行和> 20个类别列),我会重复对其进行子集设置,因此我想尽可能地加快速度。 It needs to be a general function (each categorical column has a discrete number of possible values, say in string format). 它必须是一个通用函数(每个分类列都有离散数量的可能值,例如字符串格式)。

Each time I subset, I need to identify the subset of rows that satisfy multiple logical set membership conditions (eg >10 conditions). 每次我设置子集时,我都需要标识满足多个逻辑集成员条件(例如> 10个条件)的行子集。 Ie, I need to check several columns and check if values in that column match a certain set membership (hence the use of %in% ). 即,我需要检查几列,并检查该列中的值是否与某个集合成员身份匹配(因此使用%in% )。

# simple dataset example
library(dplyr)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
  dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))

I've looked around the internet and SO a lot, but haven't found the discussion of performance I'm looking for. 我已经在互联网上四处张望,但没有找到我想要的性能讨论。 I mostly code using dplyr , so apologies if there's a clear performance improvement here in data.table ; 我主要是使用代码dplyr ,所以道歉,如果在这里有一个明显的性能改进data.table ; I've tried some simple benchmarks between the two (but without using any data.table indexing or etc.) and it's not obvious if one is faster. 我已经尝试了两者之间的一些简单基准测试(但未使用任何data.table索引等),并且如果速度更快,则并不明显。

Example options I've considered (since I'm not great at data.table , I've excluded data.table options from here): 我考虑过的示例选项(由于我不data.table ,因此我从这里排除了data.table选项):

base_filter <- function(dat) {
  for (i in 1:7) {
    col_name <- paste0('col', i)
    dat <- dat[dat[[col_name]] %in% sample(letters[1:10], size = 4), ]
  }
  dat
}
dplyr_filter1 <- function(dat) {
  for (i in 1:7) {
    col_name <- paste0('col', i)
    dat <- filter_(dat,
                   .dots = interp(~ colname %in% vals,
                          colname = as.name(col_name),
                          vals = sample(letters[1:10], size = 4)))
  }
  dat
}
dplyr_filter2 <- function(dat) {
  dots_filter <- list()
  for (i in 1:7) {
    col_name <- paste0('col', i)
    dots_filter[[i]] <- interp(~ colname %in% vals,
                               colname = as.name(col_name),
                               vals = sample(letters[1:10], size = 4))
  }
  filter_(dat, .dots = dots_filter)
}

Note: In practice, on my real datasets, dplyr_filter2 actually works fastest. 注意:实际上,在我的真实数据集上, dplyr_filter2实际上运行最快。 I've also tried dtplyr or converting my data to a data.table , but this seems slower than without. 我也尝试过dtplyr或将我的数据转换为data.table ,但这似乎比没有它慢。
Note: On the other hand, in practice, the base R function outperforms the dplyr examples when data has fewer rows and fewer columns (perhaps due to copying speed?). 注意:另一方面,实际上,当数据具有更少的行和更少的列(也许是由于复制速度?)时,基本R函数的性能优于dplyr示例。

Thus, I'd like to ask SO what the general, most efficient way(s) to subset a categorical dataframe under multiple (set membership) conditions is. 因此,我想问一下在多个(集合成员资格)条件下,对分类数据帧进行子集化的最通用,最有效的方法是什么。 And if possible, explain the mechanics for why? 如果可能,请解释其原因? Does this answer differ for smaller datasets? 对于较小的数据集,此答案是否有所不同? Does it depend on copying time or search time? 它取决于复制时间还是搜索时间?

Useful related links 有用的相关链接

Understand that you prefer not to use data.table. 了解您不希望使用data.table。 Just providing some timings for reference below. 请在下面提供一些参考时间。 With indexing, subsetting can be performed much faster and inner join of the 2 tables can also be done easily in data.table . 使用索引,子集可以更快地执行,并且两个表的内部data.table也可以在data.table轻松data.table

# simple dataset example
library(dplyr)
library(lazyeval)
set.seed(0L)
num_col <- 15
num_row <- 100000
dat_list <- list()
for (i in 1:num_col) {
    dat_list[[i]] <- data_frame(sample(letters[1:10], size = num_row, r = T))
}
dat <- bind_cols(dat_list)
names(dat) <- paste0("col", seq(15))

selection <- lapply(1:7, function(n) sample(letters[1:10], size = 4))

base_filter <- function(df) {
    for (i in 1:7) {
        col_name <- paste0('col', i)
        df <- df[df[[col_name]] %in% selection[[i]], ]
    }
    df
}

dplyr_filter1 <- function(df) {
    for (i in 1:7) {
        col_name <- paste0('col', i)
        df <- filter_(df,
            .dots = interp(~ colname %in% vals,
                colname = as.name(col_name),
                vals = selection[[i]]))
    }
    df
}

dplyr_filter2 <- function(df) {
    dots_filter <- list()
    for (i in 1:7) {
        col_name <- paste0('col', i)
        dots_filter[[i]] <- interp(~ colname %in% vals,
            colname = as.name(col_name),
            vals = selection[[i]])
    }
    filter_(df, .dots = dots_filter)
}


library(data.table)

#convert data.frame into data.table
dt <- data.table(dat, key=names(dat)[1:7])

#create the sets of selection
dtSelection <- data.table(expand.grid(selection, stringsAsFactors=FALSE))


library(microbenchmark)
microbenchmark(
    base_filter(dat),
    dplyr_filter1(dat),
    dplyr_filter2(dat),
    dt[dtSelection, nomatch=0],   #perform inner join between dataset and selection
    times=5L)

#Unit: milliseconds
#                         expr       min        lq      mean    median        uq       max neval
#             base_filter(dat) 27.084801 27.870702 35.849261 32.045900 32.872601 59.372301     5
#           dplyr_filter1(dat) 23.130100 24.114301 26.922081 24.860701 29.804301 32.701002     5
#           dplyr_filter2(dat) 29.641101 30.686002 32.363681 31.103000 31.884701 38.503601     5
# dt[dtSelection, nomatch = 0]  3.626001  3.646201  3.829341  3.686601  3.687001  4.500901     5

In addition to chinsoon12's alternatives, one thing to consider is to avoid subsetting the data.frame in each iteration. 除了chinsoon12的替代方案之外,要考虑的一件事是避免在每次迭代中对data.frame进行子集设置。 So, instead of 所以,代替

f0 = function(x, cond)
{
    for(j in seq_along(x)) x = x[x[[j]] %in% cond[[j]], ]
    return(x)
}

one alternative is to accumulate a logical vector of whether to include each row in the final subset: 一种选择是累积是否在最终子集中包括每一行的逻辑向量:

f1 = function(x, cond)
{
    i = rep_len(TRUE, nrow(x))
    for(j in seq_along(x)) i = i & (x[[j]] %in% cond[[j]])
    return(x[i, ])
}

or, another alternative, is to iteratively reduce the amount of comparisons, but by reducing the row indices instead of the data.frame itself: 或者,另一种方法是迭代地减少比较的数量,但是要减少行索引而不是data.frame本身:

f2 = function(x, cond)
{
    i = 1:nrow(x)
    for(j in seq_along(x)) i = i[x[[j]][i] %in% cond[[j]]]
    return(x[i, ])
}

And a comparison with data: 并与数据进行比较:

set.seed(1821)
dat = as.data.frame(replicate(30, sample(c(letters, LETTERS), 5e5, TRUE), FALSE), 
                    stringsAsFactors = FALSE)
conds = replicate(ncol(dat), sample(c(letters, LETTERS), 48), FALSE)

system.time({ ans0 = f0(dat, conds) })
#   user  system elapsed 
#   3.44    0.28    3.86 
system.time({ ans1 = f1(dat, conds) })
#   user  system elapsed 
#   0.66    0.01    0.68 
system.time({ ans2 = f2(dat, conds) })
#   user  system elapsed 
#   0.34    0.01    0.39

identical(ans0, ans1)
#[1] TRUE
identical(ans1, ans2)
#[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM