简体   繁体   English

如何在 R 中将 DataFrame 分组

[英]How to Divide DataFrame into groups in R

I have a large data frame like this :我有一个像这样的大数据框:

df: df:

col_1   col_2  col_3
  1       2       1

and I want to divide it to this subgroups very fast:我想非常快地将它划分为这个子组:

df_1: df_1:

 col_1   col_3
  1       1

df_2: df_2:

  col_2
    2

I know there is a way with which and is like this:我知道有一种方法是这样的:

df_1 <- df[df == 1]
df_2 <- df[df == 2]

but it's not fast.但它并不快。 what should i do ?我该怎么办 ?

thanks谢谢

An option with dplyr and tidyr : dplyrtidyr的选项:

 df %>% 
  tidyr::gather(key,val) %>% 
  group_split(val)   #attributed to @agila for pointing out the unnenecessary call to group_by that I missed initially
[[1]]
# A tibble: 2 x 2
  key     val
  <chr> <int>
1 col_1     1
2 col_3     1

[[2]]
# A tibble: 1 x 2
  key     val
  <chr> <int>
1 col_2     2

attr(,"ptype")

Welcome to SO.欢迎来到 SO。

I'd suggest looking at the dplyr and data.table packages, which are focused on fast and memory efficient implementations.我建议查看dplyrdata.table包,它们专注于快速和内存高效的实现。 Especially i'd suggest the amazing answers to this question , which will give an good understanding of what these two packages are capable of.特别是我建议对这个问题给出惊人的答案,这将有助于很好地理解这两个包的功能。

data.table does tend to outperform dplyr as the number of groups and repeated subsets grow, as it utilizes indexed and keyed subsets, but for most it comes down to a preference. data.table组和重复子集数量的增长, data.table确实往往优于dplyr ,因为它利用了索引和键控子集,但对于大多数人来说,它归结为偏好。 Focusing on subsetting I'll provide a reproducible example and some speed comparisons.专注于子集,我将提供一个可重现的示例和一些速度比较。

Reproducible example可重现的例子

set.seed(1)
df <- data.frame(group = sample(LETTERS, 1e7, TRUE), 
                 random_numbers = rnorm(1e7), 
                 random_binaries = rbinom(1e7, 1, 0.3))
# size = 152.6 MiB
format(object.size(df), units = "MiB")

Methods:方法:

Base-R methods Base-R 方法

Now in base-R subsetting can be performed in a myriad ways, one is the one you have shown yourself.现在在 base-R 中可以通过多种方式执行子集化,其中一种是您向自己展示的方式。 df[df == ..] . df[df == ..] An alternative is to use the subset function, however this a utility function, and is focused on readability rather than speed, and will usually perform worse.另一种方法是使用subset函数,但这是一个实用函数,关注可读性而不是速度,通常性能更差。 An example of their use is given below.下面给出了它们的使用示例。 However one may use the which function, to convert a logical vector into indices, and doing this may improve performance.然而,可以使用which函数将逻辑向量转换为索引,这样做可以提高性能。

df[df$group == "C",]
#Equivalent
df[which(df$group == "C"),]
#Equivalent
subset(df, group == "C")

dplyr methods dplyr 方法

An alternative is the dplyr package.另一种选择是 dplyr 包。 The dplyr is syntax sugar, giving piping options not unlike a few other packages (for example the magrittr package), but different benchmarks (shown in the first link ) show that this package can be used to improve performance on various aspects. dplyr 是语法糖,提供与其他一些包(例如magrittr包)不同的管道选项,但不同的基准测试(显示在第一个链接中)表明该包可用于提高各个方面的性能。 However i am no expert on this package, as i tend to use the data.table package.但是我不是这个包的专家,因为我倾向于使用data.table包。 The package provides the %>% piping function and some utility functions such as filter which can be used for subsetting data该包提供了%>%管道函数和一些实用函数,例如可用于子集数据的filter

library(dplyr)
df %>% filter(group == "C")
# subsetting two columns
df %>% filter(group == "C", random_binaries == TRUE) #Equivalent to group == "C" & random_binaries == TRUE

Data.table methods:数据表方法:

Last another popular package is the data.table package.最后一个流行的包是data.table包。 This package is designed for performance and memory efficiency, like dplyr .该软件包专为提高性能和内存效率而设计,例如dplyr The syntax is designed to be similar to SQL statements, (select, from, where, group by), but starting out the syntax can be a bit confusing.语法设计为类似于 SQL 语句(select、from、where、group by),但开始语法可能有点混乱。 The package provides a new data.table class, to be used rather than the data.frame class, which is notoriously slow for subsetting.该包提供了一个新的data.table类,而不是data.frame类,这是众所周知的子集化速度慢。

However, one can almost completely ignore the syntax of the package, as the data.table utilizes the data.frame syntax in most cases, and can be used as a data.frame in every circumstance.然而,可以几乎完全忽略包的语法,作为data.table利用data.frame在大多数情况下的语法,并且可以被用作data.frame在每一种情况。

library(data.table)
#Convert the data.frame to data.table
setDT(df) 

The data.table has two standard methods: Using indices and using keys . data.table 有两种标准方法:使用索引和使用 Indices are used if one uses similar methods to the data.frame methods:如果使用类似的方法来该指数用于data.frame方法:

df1 <- df[random_binaries == TRUE]
df2 <- df[group == "C"]

Indices has roughly the same speed on the first usage but will increase on performance on every subsequent use.索引在第一次使用时具有大致相同的速度,但在每次后续使用时都会提高性能。

Keys are used to pre-sort the data.table , which allows for smart subsetting.用于对data.table进行预排序,从而实现智能子集化。 Setting the key does take some time, and has a slightly different syntax, but outperforms other methods (although indices are similar in speed)设置键确实需要一些时间,并且语法略有不同,但优于其他方法(尽管索引速度相似)

#Set the key using either setkey, or setkeyv (multiple columns)
setkeyv(df, c("group", "random_binaries"))
#Subset on group
df[.("C")]
#subset on random_binaries
df[CJ(group, TRUE, unique = TRUE)]
df[.(unique(group), TRUE)]
# Subset on multiple conditions
df[.(c("C", "H"), c(TRUE, TRUE))]

The syntax may be confusing, but one can check out their useful wiki page , or the many stackoverflow posts (8968 as of today), which provide answers to most questions.语法可能令人困惑,但可以查看他们有用的wiki 页面或许多stackoverflow 帖子(截至今天 8968 篇),这些帖子提供了大多数问题的答案。

Performance comparison性能对比

I've checked the performance of the subsetting methods presented, which are visualized below.我已经检查了所提供的子集方法的性能,如下所示。 The visualization, shows the various methods for a subset of group == "C" and group == "H" & random_binaries == TRUE" using the methods illustrated. The x-axis indicates the run time in milliseconds, and the y-axis shows the methods. The widths of the blobs indicates the range, while the size of the blot illustrates the density of times in a range.可视化显示了使用所示方法的group == "C"group == "H" & random_binaries == TRUE"子集的各种方法group == "H" & random_binaries == TRUE"轴表示以毫秒为单位的运行时间,y-轴表示方法。斑点的宽度表示范围,而斑点的大小表示范围内的时间密度。

From the visualization one can see, that for a dataset of 2 columns, subsetting on both 1 and 2 columns, the data.table method using keys are much much faster (marked as data.table_.._keyed ), while using indexes slightly outperforms the remaining methods.从可视化中可以看出,对于 2 列的数据集,对 1 列和 2 列进行子集化,使用键的data.table方法要快得多(标记为data.table_.._keyed ),而使用索引略胜一筹剩下的方法。 Using subset is slower than standard methods, and suprisingly for this illustration dplyr is slower than base-R, however this is might be due to my inexperience with the package.使用subset比标准方法慢,而且dplyr是,对于这个插图, dplyr比 base-R 慢,但这可能是由于我对包的经验不足。

在此处输入图片说明

Here's one way using lapply from base R which gives you a list of your desired dataframes -这是使用 base R 中的lapply的一种方法,它为您提供所需数据帧的列表 -

df <- data.frame(col_1 = 1, col_2 = 2, col_3 = 1)

lapply(unique(unlist(df)), function(x) {
  df[, df == x, drop = F]
})

# output

[[1]]
  col_1 col_3
1     1     1

[[2]]
  col_2
1     2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM