繁体   English   中英

是否有 R 函数可以撤消 cumsum() 并重新创建数据集中的原始非累积列?

[英]Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?

为简单起见,我创建了一个小型虚拟数据集。

请注意:日期采用 yyyy-mm-dd 格式

这是数据集 DF:

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))

# A tibble: 12 x 3
   country date       visits
   <chr>   <chr>       <dbl>
 1 France  2020-01-01     10
 2 France  2020-01-02     16
 3 France  2020-01-03     14
 4 France  2020-01-04     12
 5 England 2020-01-01     11
 6 England 2020-01-02      9
 7 England 2020-01-03     12
 8 England 2020-01-04     14
 9 Spain   2020-01-01     13
10 Spain   2020-01-02     13
11 Spain   2020-01-03     15
12 Spain   2020-01-04     10

这是数据集 DFc:

DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))

# A tibble: 12 x 3
# Groups:   country [3]
   country date       cumulative_visits
   <chr>   <chr>                  <dbl>
 1 France  2020-01-01                10
 2 France  2020-01-02                26
 3 France  2020-01-03                40
 4 France  2020-01-04                52
 5 England 2020-01-01                11
 6 England 2020-01-02                20
 7 England 2020-01-03                32
 8 England 2020-01-04                46
 9 Spain   2020-01-01                13
10 Spain   2020-01-02                26
11 Spain   2020-01-03                41
12 Spain   2020-01-04                51

假设我只有数据集 DFc。 我可以使用哪些 R 函数来重新创建访问列(如数据集 DF 中所示)和本质上“撤消/反向”cumsum()?

有人告诉我,我可以合并 lag() 函数,但我不知道如何做到这一点。

此外,如果日期间隔数周而不是一天,代码将如何更改?

任何帮助将非常感激 :)

从您的玩具示例开始:

library(dplyr)

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))


DF <- DF %>% 
  group_by(country) %>% 
  mutate(cumulative_visits = cumsum(visits)) %>% 
  ungroup()

我建议你两种方法:

  1. 差异
  2. 滞后 [根据您的具体要求]
DF %>%
  group_by(country) %>%
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()

#> # A tibble: 12 x 6
#>    country date       visits cumulative_visits decum_visits1 decum_visits2
#>    <chr>   <chr>       <dbl>             <dbl>         <dbl>         <dbl>
#>  1 France  2020-01-01     10                10            10            10
#>  2 France  2020-02-01     16                26            16            16
#>  3 France  2020-03-01     14                40            14            14
#>  4 France  2020-04-01     12                52            12            12
#>  5 England 2020-01-01     11                11            11            11
#>  6 England 2020-02-01      9                20             9             9
#>  7 England 2020-03-01     12                32            12            12
#>  8 England 2020-04-01     14                46            14            14
#>  9 Spain   2020-01-01     13                13            13            13
#> 10 Spain   2020-02-01     13                26            13            13
#> 11 Spain   2020-03-01     15                41            15            15
#> 12 Spain   2020-04-01     10                51            10            10

如果缺少一个日期,比方说,如下例所示:

DF1 <- DF %>% 
  
  # set to date!
  mutate(date = as.Date(date)) %>%
  
  # remove one date just for the sake of the example
  filter(date != as.Date("2020-02-01"))

然后我建议你complete日期,同时你用零fill visits ,并用最后看到的值fill cumulative_visits 然后你可以像以前一样得到cumsum的反面。

DF1 %>% 
  group_by(country) %>% 
  
  # complete and fill with zero!
  tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>% 
  
  # fill cumulative with the last available value
  tidyr::fill(cumulative_visits) %>%
  
  # reset in the same way
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()

这是一个通用的解决方案。 这很草率,因为正如您所看到的,这并没有返回foo[1]但可以修复。 (因为可以反转最后一行的输出。)我将把它“作为读者的练习”。

foo <- sample(1:20,10)
 [1] 16 11 13  5  6 12 19 10  3  4
 bar <- cumsum(foo)
 [1] 16 27 40 45 51 63 82 92 95 99
 rev(bar[-1])-rev(bar[-length(bar)])
[1]  4  3 10 19 12  6  5 13 11

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM