简体   繁体   English

使用字符串向量输入按 dplyr 中的多列分组

[英]Group by multiple columns in dplyr, using string vector input

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.我正在尝试将我对 plyr 的理解转移到 dplyr 中,但我无法弄清楚如何按多列进行分组。

# make data with weird column names that can't be hard coded
data = data.frame(
  asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

# plyr - works
ddply(data, columns, summarize, value=mean(value))

# dplyr - raises error
data %.%
  group_by(columns) %.%
  summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds

What am I missing to translate the plyr example into a dplyr-esque syntax?将 plyr 示例转换为 dplyr 式语法我错过了什么?

Edit 2017 : Dplyr has been updated, so a simpler solution is available. 2017 年编辑:Dplyr 已更新,因此可以使用更简单的解决方案。 See the currently selected answer.查看当前选择的答案。

Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:为了完整地编写代码,这里是使用新语法对 Hadley 的回答进行更新:

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

output:输出:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10

The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like: dplyr 对此的支持目前非常薄弱,最终我认为语法将类似于:

df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))

But that probably won't be there for a while (because I need to think through all the consequences).但这可能不会有一段时间(因为我需要考虑所有后果)。

In the meantime, you can use regroup() , which takes a list of symbols:同时,您可以使用regroup() ,它需要一个符号列表:

library(dplyr)

df <-  data.frame(
  asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

df %.%
  regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
  summarise(n = n())

If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol() :如果您有列名的字符向量,则可以使用lapply()as.symbol()将它们转换为正确的结构:

vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)

df %.% regroup(vars2) %.% summarise(n = n())

Since this question was posted, dplyr added scoped versions of group_by ( documentation here ).由于发布了这个问题,dplyr 添加了group_by范围版本( 文档here )。 This lets you use the same functions you would use with select , like so:这使您可以使用与select相同的功能,如下所示:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27 

The output from your example question is as expected (see comparison to plyr above and output below):您的示例问题的输出符合预期(参见与上面的 plyr 和下面的输出的比较):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line).请注意,由于dplyr::summarize只剥离一层分组,因此您仍然在结果小标题中进行了一些分组(有时可能会在稍后的过程中引起人们的注意)。 If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.如果您想绝对避免意外的分组行为,您可以在汇总后始终将%>% ungroup添加到您的管道中。

String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore.现在通过dplyr函数的变体支持dplyr中列的字符串规范,名称以下划线结尾。 For example, corresponding to the group_by function there is a group_by_ function that may take string arguments.例如,对应于group_by函数,有一个group_by_函数可以接受字符串参数。 This vignette describes the syntax of these functions in detail. 此小插图详细描述了这些函数的语法。

The following snippet cleanly solves the problem that @sharoz originally posed (note the need to write out the .dots argument):以下代码片段干净地解决了@sharoz 最初提出的问题(注意需要写出.dots参数):

# Given data and columns from the OP

data %>%
    group_by_(.dots = columns) %>%
    summarise(Value = mean(value))

(Note that dplyr now uses the %>% operator, and %.% is deprecated). (请注意,dplyr 现在使用%>%运算符,而%.%已弃用)。

Until dplyr has full support for string arguments, perhaps this gist is useful:在 dplyr 完全支持字符串参数之前,也许这个要点很有用:

https://gist.github.com/skranz/9681509 https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments.它包含一堆使用字符串参数的包装函数,如 s_group_by、s_mutate、s_filter 等。 You can mix them with the normal dplyr functions.您可以将它们与普通的 dplyr 函数混合使用。 For example例如

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)

It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:如果您将对象传递给它(好吧,您不是,但是......)而不是作为字符向量,它会起作用:

df %.%
    group_by(asdfgfTgdsx, asdfk30v0ja) %.%
    summarise(Value = mean(value))

> df %.%
+   group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+   summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx

  asdfgfTgdsx asdfk30v0ja        Value
1           A           C  0.046538002
2           C           B -0.286359899
3           B           A -0.305159419
4           C           A -0.004741504
5           B           B  0.520126476
6           C           C  0.086805492
7           B           C -0.052613078
8           A           A  0.368410146
9           A           B  0.088462212

where df was your data .其中df是您的data

?group_by says: ?group_by说:

 ...: variables to group by. All tbls accept variable names, some
      will also accept functons of variables. Duplicated groups
      will be silently dropped.

which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar ;我将其解释为不是名称的字符版本,而是您将如何在foo$bar引用它们; bar is not quoted here.这里没有引用bar Or how you'd refer to variables in a formula: foo ~ bar .或者如何在公式中引用变量: foo ~ bar

@Arun also mentions that you can do: @Arun 还提到你可以这样做:

df %.%
    group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
    summarise(Value = mean(value))

But you can't pass in something that unevaluated is not a name of a variable in the data object.但是你不能传入一些未评估的不是数据对象中变量名的东西。

I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.我认为这是由于 Hadley 使用内部方法来查找您通过...参数传入的内容。

Update with across() from dplyr 1.0.0从 dplyr 1.0.0 使用 cross() 更新

All the answers above are still working, and the solutions with the .dots argument are intruiging.上面的所有答案仍然有效,带有 .dots 参数的解决方案很有趣。

BUT if you look for a solution that is easier to remember, the new across() comes in handy.但是,如果您寻找更容易记住的解决方案,新的across()会派上用场。 It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all .它由 Hadley Wickham 于 2020-04-03 发布,可用于mutate()summarise()并替换_at_all等范围变体。 Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms()最重要的是,它用引用/取消引用非常优雅地替换了繁琐的非标准评估 (NSE),例如!!! rlang::syms() !!! rlang::syms() . !!! rlang::syms()

So the solution with across looks very readable:因此,与该解决方案across看上去非常可读:

data %>%
  group_by(across(all_of(columns))) %>%
  summarize(Value = mean(value))
data = data.frame(
  my.a = sample(LETTERS[1:3], 100, replace=TRUE),
  my.b = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))

One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:这里的答案中缺少一个(微小的)案例,我想明确指出,当要分组的变量在管道中动态生成时:

library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>% 
  # 1. create quantized versions of base variables
  mutate_each(
    funs(Quantized = . > 0)
  ) %>% 
  # 2. group_by the indicator variables
  group_by_(
    .dots = grep("Quantized", names(.), value = TRUE)
    ) %>% 
  # 3. summarize the base variables
  summarize_each(
    funs(sum(., na.rm = TRUE)), contains("X_")
  )

This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.这基本上展示了如何将grepgroup_by_(.dots = ...)结合使用来实现这一点。

General example on using the .dots argument as character vector input to the dplyr::group_by function :使用.dots参数作为dplyr::group_by函数的字符向量输入的一般示例:

iris %>% 
    group_by(.dots ="Species") %>% 
    summarise(meanpetallength = mean(Petal.Length))

Or without a hard coded name for the grouping variable (as asked by the OP):或者没有分组变量的硬编码名称(如 OP 所要求的):

iris %>% 
    group_by(.dots = names(iris)[5]) %>% 
    summarise_at("Petal.Length", mean)

With the example of the OP:以 OP 为例:

data %>% 
    group_by(.dots =names(data)[-3]) %>% 
    summarise_at("value", mean)

See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.另请参阅有关编程dplyr 小插图,其中解释了代词、准引用、quosures 和 tidyeval。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM