有没有办法使用dplyr基于另一列的group_by除以创建新列？

Question

I am trying to create a new column by dividing a column A of integers (data1/2/3 below) by the mode of column A when grouped by another column B of integers (group1/2 below) 我试图通过将整数列A（下面的数据1/2/3）除以整列的另一列B（下面的组1/2）分组列A的模式来创建一个新列

group1=rep(1:5,each=2)
group2=rep(6:10, each=2)
data1=c(1,1,1,1,1,4,5,6,3,8)
data2=c(5,4,5,7,8,5,2,1,1,5)
data3=c(6,6,8,9,5,4,3,3,1,1)
DF=data.frame(group1,group2,data1,data2,data3)

   group1 group2 data1 data2 data3
1       1      6     1     5     6
2       1      6     1     4     6
3       2      7     1     5     8
4       2      7     1     7     9
5       3      8     1     8     5
6       3      8     4     5     4
7       4      9     5     2     3
8       4      9     6     1     3
9       5     10     3     1     1
10      5     10     8     5     1

I have been successful in doing this one column at a time (see code below), but I would like to be able to generalize it: 我一次成功完成了这一列（见下面的代码），但我希望能够概括它：

DF %>%
  group_by(group2) %>%
  mutate(group2_mode = as.integer(head(names(sort(table(data2))),1))) %>%
  mutate(group2_data2 = data2/group2_mode) %>%
  #select(-c(group1_mode)) %>%
           ungroup()

# A tibble: 10 x 7
   group1 group2 data1 data2 data3 group2_mode group2_data2
    <int>  <int> <dbl> <dbl> <dbl>       <int>        <dbl>
 1      1      6     1     5     6           4         1.25
 2      1      6     1     4     6           4         1   
 3      2      7     1     5     8           5         1   
 4      2      7     1     7     9           5         1.4 
 5      3      8     1     8     5           5         1.6 
 6      3      8     4     5     4           5         1   
 7      4      9     5     2     3           1         2   
 8      4      9     6     1     3           1         1   
 9      5     10     3     1     1           1         1   
10      5     10     8     5     1           1         5

This works but is clunky when written out for each data/group combination. 这可行但在为每个数据/组组合写出时很笨拙。

I have tried iterating through for loops as follows: 我已经尝试迭代for循环，如下所示：

for (i in colnames(DF[,3:5])){
  for (k in colnames(DF[,1:2])){
    DF %>%
      group_by(k) %>%
      mutate(paste(c(k,"_",i), collapse = '') <- i/as.integer(head(names(sort(table(i))),1)))
  }
}

And receive the following error: 并收到以下错误：

Error: Column `k` is unknown

I expect the output to be similar to the first code chunk above but for each data/group combination. 我希望输出类似于上面的第一个代码块，但是对于每个数据/组组合。 I have also tried labeling all of the mutated columns in the for loop the same thing, but that also results in the same error. 我也尝试在for循环中标记所有变异列同样的东西，但这也会导致相同的错误。 I suspect the issue lies in the group_by statement, but I can't figure out how. 我怀疑问题出在group_by语句中，但我无法弄清楚如何。

Thank you for your time 感谢您的时间

Answer 1

Borrowing from here , we can define a helper mode function: 借用这里，我们可以定义一个辅助mode函数：

mode <- function(codes){
  which.max(tabulate(codes))
}

Then: 然后：

DF %>%
  group_by(group2) %>%
  mutate_at(vars(matches("data")), ~. / mode(.))

[This should work, in theory, but this mode function seems to work differently than yours, and I don't see how to resolve yet.] [理论上这应该有效，但这种模式功能似乎与你的不同，我看不出如何解决。]

Edit: To do this with a few multiple groups, you could create new columns like so: 编辑：要使用几个多个组执行此操作，您可以创建新列，如下所示：

  DF %>%
    group_by(group1) %>%
    mutate_at(vars(matches("data")), 
              .funs = list(gp1 = ~. / mode(.))) %>%
    group_by(group2) %>%
    mutate_at(vars(matches("data")), 
              .funs = list(gp2 = ~. / mode(.)))

# A tibble: 10 x 14
# Groups:   group2 [5]
   group1 group2 data1 data2 data3 data1_gp1 data2_gp1 data3_gp1 data1_gp2 data2_gp2 data3_gp2 data1_gp1_gp2 data2_gp1_gp2 data3_gp1_gp2
    <int>  <int> <dbl> <dbl> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>         <dbl>         <dbl>         <dbl>
 1      1      6     1     5     6      1         1.25      1         1         1.25      1             1             1.25          1   
 2      1      6     1     4     6      1         1         1         1         1         1             1             1             1   
 3      2      7     1     5     8      1         1         1         1         1         1             1             1             1   
 4      2      7     1     7     9      1         1.4       1.12      1         1.4       1.12          1             1.4           1.12
 5      3      8     1     8     5      1         1.6       1.25      1         1.6       1.25          1             1.6           1.25
 6      3      8     4     5     4      4         1         1         4         1         1             4             1             1   
 7      4      9     5     2     3      1         2         1         1         2         1             1             2             1   
 8      4      9     6     1     3      1.2       1         1         1.2       1         1             1.2           1             1   
 9      5     10     3     1     1      1         1         1         1         1         1             1             1             1   
10      5     10     8     5     1      2.67      5         1         2.67      5         1             2.67          5             1

If you have many groups, then we might want to create a function for this. 如果你有很多组，那么我们可能想为此创建一个函数。 This one mostly works, except for the naming step -- I want my group selection to also provide the name for the new column labels. 除了命名步骤之外，这个主要有效 - 我希望我的组选择也提供新列标签的名称。 := didn't seem to work for me here, which seems otherwise to be the way to name new columns in tidyeval. :=似乎在这里对我不起作用，这似乎是在tidyeval中命名新列的方法。 Can someone help me here? 有人可以帮我吗？

add_grouped_medians <- function(df, group) {
  suffix = !!group  # This part seems to be missing the right
                    #  syntax. I want to make the group input available to the
                    #  .funs list below....
  df %>%
    group_by(!! group) %>%
    mutate_at(vars(matches("data")),
              .funs = list( suffix = ~. / mode(.)))
}

Note how the output uses "suffix" literally instead of using the group name in its place: 请注意输出如何使用“suffix”字面而不是在其位置使用组名：

> DF %>% add_grouped_medians(group1, "gp1")
# A tibble: 10 x 9
# Groups:   <int> [5]
   group1 group2 data1 data2 data3 `<int>` data1_suffix data2_suffix data3_suffix
    <int>  <int> <dbl> <dbl> <dbl>   <int>        <dbl>        <dbl>        <dbl>
 1      1      6     1     5     6       1         1            1.25         1   
 2      1      6     1     4     6       1         1            1            1   
 3      2      7     1     5     8       2         1            1            1   
 4      2      7     1     7     9       2         1            1.4          1.12
 5      3      8     1     8     5       3         1            1.6          1.25
 6      3      8     4     5     4       3         4            1            1   
 7      4      9     5     2     3       4         1            2            1   
 8      4      9     6     1     3       4         1.2          1            1   
 9      5     10     3     1     1       5         1            1            1   
10      5     10     8     5     1       5         2.67         5            1

Answer 2

You could try some tidy evaluation. 你可以尝试一些整洁的评估。 The definition of Mode is taken from here . Mode的定义取自此处。

Mode <- function(x) {
    ux <- unique(x)
    ux[which.max(tabulate(match(x, ux)))]
}

We can use grep to separate group and data columns. 我们可以使用grep来分隔group和data列。 Then use a for loop over them 然后在它们上面使用for循环

library(dplyr)
library(rlang)

group_cols <- grep("^group", names(DF), value = TRUE)
data_cols <- grep("^data", names(DF), value = TRUE)

for (col  in seq_along(group_cols)) {
    data  <- sym(data_cols[col])
    DF <- DF %>%
           group_by_at(group_cols[col]) %>%
           mutate(!!paste0("group", col, "mode") := !!data/Mode(!!data))
}
DF

#   group1 group2 data1 data2 data3 group1mode group2mode
#    <int>  <int> <dbl> <dbl> <dbl>      <dbl>      <dbl>
# 1      1      6     1     5     6       1         1    
# 2      1      6     1     4     6       1         0.8  
# 3      2      7     1     5     8       1         1    
# 4      2      7     1     7     9       1         1.4  
# 5      3      8     1     8     5       1         1    
# 6      3      8     4     5     4       4         0.625
# 7      4      9     5     2     3       1         1    
# 8      4      9     6     1     3       1.2       0.5  
# 9      5     10     3     1     1       1         1    
#10      5     10     8     5     1       2.67      5

Few things to note, as already mentioned by @Jon Spring your Mode calculation is different than the standard one. 很少有事情要注意，正如@Jon Spring已经提到的，你的模式计算与标准计算不同。 If needed you can change the above Mode to your way of calculating it. 如果需要，您可以将上述Mode更改为您的计算方式。 Also in reality I hope you would have same number of group and data columns (here they are unequal). 实际上我希望你有相同数量的group和data列（这里它们是不相等的）。

Answer 3

A base solution might be just as useful - I used the mode function suggested by @Jon Spring. 基本解决方案可能同样有用 - 我使用了@Jon Spring建议的mode功能。

mode <- function(codes){
  which.max(tabulate(codes))
}

groups <- c('group1', 'group2')
datas <- c('data1', 'data2', 'data3')

for (grp in groups) {
  for (col in datas) {
    DF[, paste(col, grp, sep = '_')] <- ave(x = DF[[col]], DF[[grp]], FUN = function(x) x / mode(x))
  }
}

   group1 group2 data1 data2 data3 data1_group1 data2_group1 data3_group1 data1_group2 data2_group2 data3_group2
1       1      6     1     5     6     1.000000         1.25        1.000     1.000000         1.25        1.000
2       1      6     1     4     6     1.000000         1.00        1.000     1.000000         1.00        1.000
3       2      7     1     5     8     1.000000         1.00        1.000     1.000000         1.00        1.000
4       2      7     1     7     9     1.000000         1.40        1.125     1.000000         1.40        1.125
5       3      8     1     8     5     1.000000         1.60        1.250     1.000000         1.60        1.250
6       3      8     4     5     4     4.000000         1.00        1.000     4.000000         1.00        1.000
7       4      9     5     2     3     1.000000         2.00        1.000     1.000000         2.00        1.000
8       4      9     6     1     3     1.200000         1.00        1.000     1.200000         1.00        1.000
9       5     10     3     1     1     1.000000         1.00        1.000     1.000000         1.00        1.000
10      5     10     8     5     1     2.666667         5.00        1.000     2.666667         5.00        1.000

有没有办法使用dplyr基于另一列的group_by除以创建新列？

问题描述

3 个解决方案

解决方案1
2 2019-08-29 00:04:44

解决方案2
1 2019-08-29 01:45:50

解决方案3
1 已采纳 2019-08-29 21:43:20

有没有办法使用dplyr基于另一列的group_by除以创建新列？

问题描述

3 个解决方案

解决方案1 2 2019-08-29 00:04:44

解决方案2 1 2019-08-29 01:45:50

解决方案3 1 已采纳 2019-08-29 21:43:20

解决方案1
2 2019-08-29 00:04:44

解决方案2
1 2019-08-29 01:45:50

解决方案3
1 已采纳 2019-08-29 21:43:20