R：如何根据多个标准求和并总结表格

Question

Here is my original data frame: 这是我的原始数据框：

df <- read.table(text="
  Date         Index  Event
  2014-03-31   A      x
  2014-03-31   A      x
  2014-03-31   A      y
  2014-04-01   A      y
  2014-04-01   A      x
  2014-04-01   B      x
  2014-04-02   B      x
  2014-04-03   A      x
  2014-09-30   B      x", header = T, stringsAsFactors = F)

date_range <- seq(as.Date(min(df$Date)), as.Date(max(df$Date)), 'days')
indices <- unique(df$Index)
events_table <- unique(df$Event)

I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column for all dates prior to the value in the Date column . 我想我需要的输出来概括我的数据帧，并有对指数各指标的唯一记录，并在DATE_RANGE每个日期，而在events_table在新塔之前在日期列中的值，提供各个事件的累计值的所有日期 。 Sometimes there are no records for each index or every date. 有时每个索引或每个日期都没有记录。

Here is my desired output: 这是我想要的输出：

Date        Index  cumsum(Event = x) cumsum(Event = y)
2014-03-31  A      0                 0
2014-03-31  B      0                 0
2014-04-01  A      2                 1
2014-04-01  B      0                 0
2014-04-02  A      3                 2
2014-04-02  B      1                 0
...  
2014-09-29  A      4                 2
2014-09-29  B      2                 0
2014-09-30  A      4                 2
2014-09-30  B      2                 0

FYI -- this is a simplified version of the data frame. 仅供参考 - 这是数据框的简化版本。 There are ~200,000 records per year with hundreds of different Index fields for each Date. 每年有大约200,000条记录，每个日期有数百个不同的索引字段。

I've done this in the past before my hard drive fried using by and maybe aggregate , but the process was very slow and I'm not able to get it worked out this time around. 我之前已经完成了这个操作，然后我的硬盘驱动器使用by并且可能是aggregate ，但是这个过程非常缓慢，而且这次我无法解决这个问题。 I've also tried ddply , but I'm not able to get the cumsum function to work with it. 我也试过ddply ，但是我无法使用cumsum函数来处理它。 Using ddply , I tried something like: 使用ddply ，我尝试了类似的东西：

ddply(xo1, .(Date,Index), summarise, 
      sum.x = sum(Event == 'x'), 
      sum.y = sum(Event == 'y'))

to no avail. 无济于事。
Through searching, I've found Replicating an Excel SUMIFS formula which gets me the cumulative part of my project, but with this I wasn't able to figure out how to summarize it down to only one record per date/index combo. 通过搜索，我发现复制一个Excel SUMIFS公式，它让我得到了我的项目的累积部分，但有了这个，我无法弄清楚如何将它总结为每个日期/索引组合只有一个记录。 I also came across sum/aggregate data based on dates, R but here I wasn't able to work out the dynamic date aspect. 我也遇到了基于日期的总和/汇总数据，但是在这里我无法计算动态日期方面。

Thanks for anyone that can help! 感谢任何可以提供帮助的人！

Answer 1

library(dplyr)
library(tidyr)

df$Date <- as.Date(df$Date)

Step 1: Generate a full list of {Date, Index} pairs 第1步：生成{Date，Index}对的完整列表

full_dat <- expand.grid(
  Date = date_range, 
  Index = indices,
  stringsAsFactors = FALSE
  ) %>% 
  arrange(Date, Index) %>%
  tbl_df

Step 2: Define a cumsum() function that ignores NA 第2步：定义忽略NA的cumsum()函数

cumsum2 <- function(x){

  x[is.na(x)] <- 0
  cumsum(x)

}

Step 3: Generate totals per {Date, Index}, join with full {Date, Index} data, and compute the lagged cumulative sum. 步骤3：根据{Date，Index}生成总计，使用完整的{Date，Index}数据连接，并计算滞后累积总和。

df %>%
  group_by(Date, Index) %>%
  summarise(
    totx = sum(Event == "x"),
    toty = sum(Event == "y")
    ) %>%
  right_join(full_dat, by = c("Date", "Index")) %>% 
  group_by(Index) %>%
  mutate(
    cumx = lag(cumsum2(totx)),
    cumy = lag(cumsum2(toty))
    ) %>%
  # some clean up.
  select(-starts_with("tot")) %>%
  mutate(
    cumx = ifelse(is.na(cumx), 0, cumx),
    cumy = ifelse(is.na(cumy), 0, cumy)
    )

Answer 2

Would something like this using dplyr and tidyr work? 使用dplyr和tidyr工作会是这样吗？

library(dplyr)
library(tidyr)

df %>%
  group_by(Date, Index, Event) %>%
  summarise(events = n()) %>%
  group_by(Index, Event) %>%
  mutate(cumsum_events = cumsum(events)) %>%
  select(-events) %>%
  spread(Event, cumsum_events) %>%
  rename(sum.x = x,
         sum.y = y)

#        Date Index sum.x sum.y
#1 2014-03-31     A     2     1
#2 2014-04-01     A     3     2
#3 2014-04-01     B     1    NA
#4 2014-04-02     B     2    NA
#5 2014-04-03     A     4    NA
#6 2014-09-30     B     3    NA

R：如何根据多个标准求和并总结表格

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-01-28 17:38:12

解决方案2
1 2015-01-28 17:09:48

R：如何根据多个标准求和并总结表格

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-01-28 17:38:12

解决方案2 1 2015-01-28 17:09:48

解决方案1
3 已采纳 2015-01-28 17:38:12

解决方案2
1 2015-01-28 17:09:48