簡體   English   中英

R具有條件和重置的累積和

[英]R Cumulative Sum with a condition and a reset

我有一個由-1和1組成的信號位置指示矢量。 另外,我有基於Signal的值要求的體積數據。 基本數據表如下所示:

df <- cbind(Signal, Volume)
head(df, 20)

           Signal    Volume
2016-01-04     NA  37912403
2016-01-05     -1  23258238
2016-01-06     -1  25096183
2016-01-07     -1  45172906
2016-01-08     -1  35402298
2016-01-11     -1  29932385
2016-01-12     -1  28395390
2016-01-13     -1  33410553
2016-01-14     -1  48658623
2016-01-15      1  46132781
2016-01-19      1  30998256
2016-01-20     -1  59051429
2016-01-21      1  30518939
2016-01-22      1  30495387
2016-01-25      1  32482015
2016-01-26     -1  26877080
2016-01-27     -1  58699359
2016-01-28      1 107475327
2016-01-29      1  62739548
2016-02-01      1  46132726

我想要實現的是(不使用for循環)是產生cum volume的向量,每次Signal變化時都會重置。 另外,音量值應該乘以Signal的值,即當Signal為-1時,它應該將-Volume加到當前的暨音量上。 基於SO的類似問題,我試過了

ave(df$a, cumsum(c(F, diff(sign(diff(df$a))) != 0)*df$Volume), FUN=seq_along) 

它產生正確的信號分組,但由於某種原因不包括音量。 沒有重置,解決方案相當簡單(在SO上發布)

require(data.table)
DT <- data.table(dt)
DT[, Cum.Sum := cumsum(Volume), by=Signal]

有沒有人知道dplyr或data.table類型的解決方案,用於重置和調整累積和? 謝謝。

這可以通過以下方式實現:

library(tidyverse)
library(data.table)     

z %>%
  group_by(rleid(Signal)) %>% #advance value every time Signal changes and group by that
  mutate(cum = Signal*cumsum(Volume)) %>% #cumsum in each group
  ungroup() %>% #ungroup so you could remove the grouping column
  select(-4) #remove grouping column

或不data.table通過使用rle

z %>%
  mutate(rl = rep(1:length(rle(Signal)$length), times = rle(Signal)$length)) %>%
  group_by(rl) %>%
  mutate(cum = Signal*cumsum(Volume)) %>%
  ungroup() %>%
  select(-4)

#output
    date       Signal    Volume        cum

  <fct>       <int>     <int>      <int>
 1 2016-01-04     NA  37912403         NA
 2 2016-01-05    - 1  23258238 - 23258238
 3 2016-01-06    - 1  25096183 - 48354421
 4 2016-01-07    - 1  45172906 - 93527327
 5 2016-01-08    - 1  35402298 -128929625
 6 2016-01-11    - 1  29932385 -158862010
 7 2016-01-12    - 1  28395390 -187257400
 8 2016-01-13    - 1  33410553 -220667953
 9 2016-01-14    - 1  48658623 -269326576
10 2016-01-15      1  46132781   46132781
11 2016-01-19      1  30998256   77131037
12 2016-01-20    - 1  59051429 - 59051429
13 2016-01-21      1  30518939   30518939
14 2016-01-22      1  30495387   61014326
15 2016-01-25      1  32482015   93496341
16 2016-01-26    - 1  26877080 - 26877080
17 2016-01-27    - 1  58699359 - 85576439
18 2016-01-28      1 107475327  107475327
19 2016-01-29      1  62739548  170214875
20 2016-02-01      1  46132726  216347601

數據:

z <- read.table(text =      "date     Signal    Volume
           2016-01-04     NA  37912403
           2016-01-05     -1  23258238
           2016-01-06     -1  25096183
           2016-01-07     -1  45172906
           2016-01-08     -1  35402298
           2016-01-11     -1  29932385
           2016-01-12     -1  28395390
           2016-01-13     -1  33410553
           2016-01-14     -1  48658623
           2016-01-15      1  46132781
           2016-01-19      1  30998256
           2016-01-20     -1  59051429
           2016-01-21      1  30518939
           2016-01-22      1  30495387
           2016-01-25      1  32482015
           2016-01-26     -1  26877080
           2016-01-27     -1  58699359
           2016-01-28      1 107475327
           2016-01-29      1  62739548
           2016-02-01      1  46132726", header = T)

純粹的dplyr方式是:

df %>% 
  na.omit() %>% # omit NA to not multiply by NA
  mutate(isStep = (Signal - lag(Signal, 1)) != 0) %>% # Create a dummy variable for steps 
  mutate(isStep = ifelse(is.na(isStep), FALSE, isStep)) %>% 
  mutate(grp = cumsum(isStep)) %>% # create new ID based on steps
  group_by(grp) %>%  # group by before created steps
  mutate(res = cumsum(Signal * Volume)) %>% # calculate value
  select(x, Signal, Volume, res)

# # A tibble: 19 x 5
# # Groups:   grp [6]
#      grp          x Signal    Volume        res
#    <int>     <fctr>  <int>     <int>      <int>
#  1     0 2016-01-05     -1  23258238  -23258238
#  2     0 2016-01-06     -1  25096183  -48354421
#  3     0 2016-01-07     -1  45172906  -93527327
#  4     0 2016-01-08     -1  35402298 -128929625
#  5     0 2016-01-11     -1  29932385 -158862010
#  6     0 2016-01-12     -1  28395390 -187257400
#  7     0 2016-01-13     -1  33410553 -220667953
#  8     0 2016-01-14     -1  48658623 -269326576
#  9     1 2016-01-15      1  46132781   46132781
# 10     1 2016-01-19      1  30998256   77131037
# 11     2 2016-01-20     -1  59051429  -59051429
# 12     3 2016-01-21      1  30518939   30518939
# 13     3 2016-01-22      1  30495387   61014326
# 14     3 2016-01-25      1  32482015   93496341
# 15     4 2016-01-26     -1  26877080  -26877080
# 16     4 2016-01-27     -1  58699359  -85576439
# 17     5 2016-01-28      1 107475327  107475327
# 18     5 2016-01-29      1  62739548  170214875
# 19     5 2016-02-01      1  46132726  216347601

正如@docendo所建議的,這應該有效:

df[,cum := cumsum(Volume)*Signal,.(rleid(Signal))]

          date Signal    Volume        cum
 1: 2016-01-04     NA  37912403         NA
 2: 2016-01-05     -1  23258238  -23258238
 3: 2016-01-06     -1  25096183  -48354421
 4: 2016-01-07     -1  45172906  -93527327
 5: 2016-01-08     -1  35402298 -128929625
 6: 2016-01-11     -1  29932385 -158862010
 7: 2016-01-12     -1  28395390 -187257400
 8: 2016-01-13     -1  33410553 -220667953
 9: 2016-01-14     -1  48658623 -269326576
10: 2016-01-15      1  46132781   46132781
11: 2016-01-19      1  30998256   77131037

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM