简体   繁体   English

如何在不创建分组变量的情况下分组?

[英]How to group_by without creating a grouping variable?

I need to perform a basic group_by / mutate operation using an auxiliary grouping variable.我需要使用辅助分组变量执行基本的group_by / mutate操作。 For instance:例如:

df <- data.frame(
  u = c(0, 0, 1, 0, 1),
  v = c(8, 4, 2, 3, 5)
)

df %>%
  group_by(tmp = cumsum(u)) %>%
  mutate(w = cumprod(v)) %>%
  ungroup %>%
  select(-tmp)

My problem is that if df happens to already contain a column named tmp I will lose it.我的问题是,如果df碰巧已经包含名为tmp的列,我将丢失它。

Of course I could choose a very exotic name instead of tmp to reduce the likeliness of a collision (or I could even choose something like strrep("z", max(nchar(names(df))) + 1) to be sure) but I'd prefer to have a cleaner solution.当然,我可以选择一个非常奇特的名称而不是tmp来减少碰撞的可能性(或者我什至可以选择strrep("z", max(nchar(names(df))) + 1) ,以确保)但我更喜欢有一个更清洁的解决方案。

In other words, I'm looking for the dplyr equivalent of this data.table line:换句话说,我正在寻找的dplyr相当于此的data.table行:

setDT(df)[, w := cumprod(v), by = cumsum(u)]

We could create a function to take care of this.我们可以创建一个函数来处理这个问题。 Assuming that the temporary grouping variable to be created is 'tmp', by concatenating with the column names of the dataset and calling make.unique , if there is already a 'tmp' column in the dataset, the duplicate one will be renamed as 'tmp.1'.假设要创建的临时分组变量是'tmp',通过连接数据集的列名并调用make.unique ,如果数据集中已经有'tmp'列,重复的将重命名为' tmp.1'。 Using the !!使用!! , naming the column with 'tmp.1' (from nm1 ) will not affect the 'tmp' already present in the dataset. ,使用 'tmp.1'(来自nm1 )命名列不会影响数据集中已经存在的 'tmp'。 In case, if there is no 'tmp', column, the grouping column will be named as 'tmp' and later removed with select如果没有 'tmp', 列,分组列将被命名为 'tmp',然后用select删除

f1 <- function(dat, grpCol, Col) {
  grpCol <- enquo(grpCol)
  Col <- enquo(Col)

 changeCol <- "tmp"
 nm1 <-  tail(make.unique(c(names(dat), changeCol)), 1)
 dat %>%
    group_by(!! (nm1) := cumsum(!! grpCol)) %>%
    mutate(w = cumprod(!!Col)) %>%
     ungroup %>%
     select(-one_of(nm1)) 


}

-run the function - 运行函数

f1(df, u, v)
# A tibble: 5 x 3
#      u     v     w
#  <dbl> <dbl> <dbl>
#1  0     8.00  8.00
#2  0     4.00 32.0 
#3  1.00  2.00  2.00
#4  0     3.00  6.00
#5  1.00  5.00  5.00


 f1(df %>% mutate(tmp = 1), u, v) #create a 'tmp' column in dataset
# A tibble: 5 x 4
#      u     v   tmp     w
#  <dbl> <dbl> <dbl> <dbl>
#1  0     8.00  1.00  8.00
#2  0     4.00  1.00 32.0 
#3  1.00  2.00  1.00  2.00
#4  0     3.00  1.00  6.00
#5  1.00  5.00  1.00  5.00

As a followup (comments from @Frank) about passing expressions作为关于传递表达式的后续(来自@Frank 的评论)

expr <- quos(tmp = cumsum(u), w = cumprod(v))
#additional checks outside the function
names(expr)[1] <- if(names(expr)[1] %in% names(df)) 
             strrep(names(expr)[1], 2) else names(expr)[1]


f2 <- function(dat, exprs ){

dat %>%
    group_by(!!! exprs[1]) %>%
    mutate(!!! exprs[2])

}

f2(df, expr)
# A tibble: 5 x 4
# Groups: tmp [3]
#      u     v   tmp     w
#  <dbl> <dbl> <dbl> <dbl> 
#1  0     8.00  0     8.00
#2  0     4.00  0    32.0 
#3  1.00  2.00  1.00  2.00
#4  0     3.00  1.00  6.00
#5  1.00  5.00  2.00  5.00

You could use ave instead:你可以使用ave代替:

df %>% mutate(w = ave(v, cumsum(u), FUN = cumprod))

by would also work: by也可以:

df %>% 
   by(cumsum(.$u), mutate, w = cumprod(v)) %>% 
   unclass %>% 
   bind_rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM