是否有 R 函数可以撤消 cumsum() 并重新创建数据集中的原始非累积列？

Question

为简单起见，我创建了一个小型虚拟数据集。

请注意：日期采用 yyyy-mm-dd 格式

这是数据集 DF：

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))

# A tibble: 12 x 3
   country date       visits
   <chr>   <chr>       <dbl>
 1 France  2020-01-01     10
 2 France  2020-01-02     16
 3 France  2020-01-03     14
 4 France  2020-01-04     12
 5 England 2020-01-01     11
 6 England 2020-01-02      9
 7 England 2020-01-03     12
 8 England 2020-01-04     14
 9 Spain   2020-01-01     13
10 Spain   2020-01-02     13
11 Spain   2020-01-03     15
12 Spain   2020-01-04     10

这是数据集 DFc：

DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))

# A tibble: 12 x 3
# Groups:   country [3]
   country date       cumulative_visits
   <chr>   <chr>                  <dbl>
 1 France  2020-01-01                10
 2 France  2020-01-02                26
 3 France  2020-01-03                40
 4 France  2020-01-04                52
 5 England 2020-01-01                11
 6 England 2020-01-02                20
 7 England 2020-01-03                32
 8 England 2020-01-04                46
 9 Spain   2020-01-01                13
10 Spain   2020-01-02                26
11 Spain   2020-01-03                41
12 Spain   2020-01-04                51

假设我只有数据集 DFc。 我可以使用哪些 R 函数来重新创建访问列（如数据集 DF 中所示）和本质上“撤消/反向”cumsum()？

有人告诉我，我可以合并 lag() 函数，但我不知道如何做到这一点。

此外，如果日期间隔数周而不是一天，代码将如何更改？

任何帮助将非常感激：）

Answer 1

从您的玩具示例开始：

library(dplyr)

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))


DF <- DF %>% 
  group_by(country) %>% 
  mutate(cumulative_visits = cumsum(visits)) %>% 
  ungroup()

我建议你两种方法：

差异
滞后 [根据您的具体要求]

DF %>%
  group_by(country) %>%
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()

#> # A tibble: 12 x 6
#>    country date       visits cumulative_visits decum_visits1 decum_visits2
#>    <chr>   <chr>       <dbl>             <dbl>         <dbl>         <dbl>
#>  1 France  2020-01-01     10                10            10            10
#>  2 France  2020-02-01     16                26            16            16
#>  3 France  2020-03-01     14                40            14            14
#>  4 France  2020-04-01     12                52            12            12
#>  5 England 2020-01-01     11                11            11            11
#>  6 England 2020-02-01      9                20             9             9
#>  7 England 2020-03-01     12                32            12            12
#>  8 England 2020-04-01     14                46            14            14
#>  9 Spain   2020-01-01     13                13            13            13
#> 10 Spain   2020-02-01     13                26            13            13
#> 11 Spain   2020-03-01     15                41            15            15
#> 12 Spain   2020-04-01     10                51            10            10

如果缺少一个日期，比方说，如下例所示：

DF1 <- DF %>% 
  
  # set to date!
  mutate(date = as.Date(date)) %>%
  
  # remove one date just for the sake of the example
  filter(date != as.Date("2020-02-01"))

然后我建议你complete日期，同时你用零fill visits ，并用最后看到的值fill cumulative_visits 。 然后你可以像以前一样得到cumsum的反面。

DF1 %>% 
  group_by(country) %>% 
  
  # complete and fill with zero!
  tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>% 
  
  # fill cumulative with the last available value
  tidyr::fill(cumulative_visits) %>%
  
  # reset in the same way
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()

Answer 2

这是一个通用的解决方案。 这很草率，因为正如您所看到的，这并没有返回foo[1]但可以修复。 （因为可以反转最后一行的输出。）我将把它“作为读者的练习”。

foo <- sample(1:20,10)
 [1] 16 11 13  5  6 12 19 10  3  4
 bar <- cumsum(foo)
 [1] 16 27 40 45 51 63 82 92 95 99
 rev(bar[-1])-rev(bar[-length(bar)])
[1]  4  3 10 19 12  6  5 13 11

是否有 R 函数可以撤消 cumsum() 并重新创建数据集中的原始非累积列？

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-10-19 14:50:55

解决方案2
0 2020-10-19 18:06:05

是否有 R 函数可以撤消 cumsum() 并重新创建数据集中的原始非累积列？

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-10-19 14:50:55

解决方案2 0 2020-10-19 18:06:05

解决方案1
0 已采纳 2020-10-19 14:50:55

解决方案2
0 2020-10-19 18:06:05