简体   繁体   English

Dplyr - 汇总多个变量

[英]Dplyr - summarising multiple variables

My original data is structured as such:我的原始数据结构如下:

Article  Channel1_qty Channel2_qty Channel3_qty

 110        30             10           0
 110        40             0            10
 111        50             5            2
 111        60             3            18

I'm ultimately trying to produce a df that shows sums of articles of clothing sold for each channel_qty as well as counts of the number of articles.我最终试图生成一个 df,显示每个 channel_qty 售出的服装总和以及文章数量。 Using the above example, it would look something like:使用上面的例子,它看起来像:

Article_count | channel | Sum (total article qty for channel)
      2            1        180
      2            2        18
      2            3        30

I attempted to structure it this way with the following code, but it didn't work:我尝试使用以下代码以这种方式构造它,但它不起作用:

df %>%
  select(Article,
         channel1_qty, 
         channel2_qty,
         channel3_qty) %>% 
  gather(key = "channel", value = "value", -Article) %>%
  group_by(channel)
  summarise(
    Article_count = n_distinct(Article),
    total = sum(value)
  )

Tried a few variations of this.尝试了一些变化。 Thinking of doing it in separate steps or as a loop, if necessary.如有必要,可以考虑以单独的步骤或循环进行。 I'm thinking there must be an easier / more elegant way in dplyr, though.不过,我认为 dplyr 中必须有一种更简单/更优雅的方式。 Thanks!谢谢!

You are on the right track to tidyr::gather()/tidyr::pivot_longer() , followed by dplyr::group_by() and then finally dplyr::summarize() .您走在正确的轨道上tidyr::gather()/tidyr::pivot_longer() ,然后是dplyr::group_by() ,最后dplyr::summarize()

The regex in the names_pattern argument just strips away everything except the actual channel number from the original column names. names_pattern参数中的正则表达式只是从原始列名中删除了除实际通道号之外的所有内容。

library(tidyverse)

d <- structure(list(Article = c(110L, 110L, 111L, 111L), Channel1_qty = c(30L, 40L, 50L, 60L), Channel2_qty = c(10L, 0L, 5L, 3L), Channel3_qty = c(0L, 10L, 2L, 18L)), class = "data.frame", row.names = c(NA, -4L))

d %>% 
  pivot_longer(-Article, 
               names_pattern = "^Channel(.*)_qty", 
               names_to = "channel", 
               values_to = "qty") %>% 
  group_by(channel) %>% 
  summarize(Article_count = n_distinct(Article),
            Sum = sum(qty))
#> # A tibble: 3 × 3
#>   channel Article_count   Sum
#>   <chr>           <int> <int>
#> 1 1                   2   180
#> 2 2                   2    18
#> 3 3                   2    30

Created on 2022-08-04 by the reprex package (v2.0.1)reprex package (v2.0.1) 于 2022 年 8 月 4 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM