简体   繁体   English

如何计算R中前几年的运行总额?

[英]How to calculate running total for prior years in R?

I have a set of variables in the dataset -- I want to simply calculate the running total (and the running mean) for all these variables, based on all prior years. 我有一组数据集中的变量-我想简单地计算出所有这些变量的运行总量(和移动平均),基于以往年。

To illustrate. 为了显示。 This is how my data looks like, including the total run variable that I want to generate. 这就是我的数据的样子,包括我想生成的总运行变量。

country year    X1  X2  X3  X4  X5  running_total

Bahamas 1990    0   0   0   0   1   NA
Bahamas 1991    0   0   1   1   0   1
Bahamas 1992    1   1   0   0   1   3
Bahamas 1993    0   0   0   0   0   6
Bahamas 1994    1   1   0   1   1   6
Bahamas 1995    0   0   1   0   0   10
Bahamas 1996    0   1   0   1   0   11
Bahamas 1997    1   0   1   0   1   13
Bahamas 1998    0   1   0   1   0   16
Bahamas 1999    1   0   1   0   1   18
Bahamas 2000    0   1   0   1   0   21
Bahamas 2001    1   0   1   0   1   23
Bahamas 2002    0   1   0   1   0   26
Bahamas 2003    1   0   0   0   1   28
Bahamas 2004    0   0   0   1   0   30
Bahamas 2005    1   1   0   0   0   31
Bahamas 2006    0   0   1   1   1   33
Bahamas 2007    1   0   0   0   0   36
Bahamas 2008    0   0   1   1   1   37
Bahamas 2009    1   1   0   0   0   40
Bahamas 2010    0   0   1   1   1   42
Bahamas 2011    1   1   0   0   0   45
Bolivia 1990    0   0   0   0   0   NA
Bolivia 1991    0   0   1   1   0   0
Bolivia 1992    0   0   0   0   0   2
Bolivia 1993    0   0   1   0   0   2
Bolivia 1994    0   0   0   0   0   3
Bolivia 1995    0   0   0   0   0   3
Bolivia 1996    0   0   0   0   0   3
Bolivia 1997    0   0   0   0   0   3
Bolivia 1998    0   0   0   0   0   3
Bolivia 1999    0   0   0   0   0   3
Bolivia 2000    0   1   0   1   0   3
Bolivia 2001    0   0   0   0   0   5
Bolivia 2002    0   0   0   0   0   5
Bolivia 2003    0   0   0   0   0   5
Bolivia 2004    0   0   0   0   0   5
Bolivia 2005    0   0   0   0   0   5
Bolivia 2006    0   0   0   0   0   5
Bolivia 2007    0   0   0   0   0   5
Bolivia 2008    0   0   0   0   1   5
Bolivia 2009    0   0   0   0   0   6
Bolivia 2010    0   0   0   0   1   6
Bolivia 2011    0   0   0   0   0   7

Starting year 1990 ==NA. 从1990年开始== NA。 For example, running total for 1991 is based on 1990. Running total for 1992 is based on 1990-1991. 例如,1991年的总计基于1990。1992年的总计基于1990-1991。 running total for 1993 is based on 1990-1992- running total for 1994 is based on 1990-1993. 1993年的总运行量基于1990-1992年-1994年的总运行量基于1990-1993年。 And so on...until 2011. Then it starts the same procedur for new country B. 依此类推...直到2011年。然后,它对新国家B开始相同的程序。

I tried the following code below but it doesn't work the way I want. 我在下面尝试了以下代码,但它无法按我想要的方式工作。 Surely, I need to specify it better, but how? 当然,我需要更好地指定它,但是如何呢?

DF$csum <- ave(DF$X1, DF$X2,DF$X3,DF$X4,DF$X5,FUN=cumsum)

In addition, I would like to generate running mean based on the same logic. 另外,我想基于相同的逻辑生成运行平均值。

Any help here would be much appreciated! 在这里的任何帮助将不胜感激!

structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bahamas", "Bolivia"), class = "factor"), year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L), X1 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X3 = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 结构(列表(国家=结构(c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, 1L,1L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L)。标签= c(“巴哈马”,“玻利维亚”),类别=“因子”),年份= c(1990L,1991L,1992L,1993L,1994L,1995L,1996L,1997L,1998L,1999L,2000L,2001L,2002L, 2003L,2004L,2005L,2006L,2007L,2008L,2009L,2010L,2011L,1990L,1991L,1992L,1993L,1994L,1995L,1996L,1997L,1998L,1999L,2000L,2001L,2002L,2003L,2004L,2005L, 2006L,2007L,2008L,2009L,2010L,2011L),X1 = c(0L,0L,1L,0L,1L,0L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L ,1L,0L,1L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L ,0L,0L),X2 = c(0L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L,0L,1L,0L,0L,0L,1L, 0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L),X3 = c(0L,1L,0L,0L,0L,1L,0L,1L,0L,1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X4 = c(0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X5 = c(1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), running_total = c(NA, 1L, 3L, 6L, 6L, 10L, 11L, 13L, 16L, 18L, 21L, 23L, 26L, 28L, 30L, 31L, 33L, 36L, 37L, 40L, 42L, 45L, NA, 0L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L)), .Names = c("country", "year", "X1", "X2", "X3", "X4", "X5", "running_total"), class = "data.frame", row.names = c(NA, -44L)) 0L,1L,0L,0L,0L,0L,1L,0L,1L,0L,1L,0L,0L,1L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L, 0L,0L,0L,0L,0L,0L,0L,0L,0L),X4 = c(0L,1L,0L,0L,1L,0L,1L,0L,1L,0L,1L,0L,1L,0L ,1L,0L,1L,0L,1L,0L,1L,0L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L,0L,0L,0L,0L,0L ,0L,0L,0L,0L,0L),X5 = c(1L,0L,1L,0L,1L,0L,0L,1L,0L,1L,0L,1L,0L,1L,0L,0L,1L, 0L,1L,0L,1L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,0L, 1L,0L),running_total = c(NA,1L,3L,6L,6L,10L,11L,13L,16L,18L,21L,23L,26L,28L,30L,31L,33L,36L,37L,40L,42L ,45L,NA,0L,2L,2L,3L,3L,3L,3L,3L,3L,3L,5L,5L,5L,5L,5L,5L,5L,5L,6L,6L,7L)))。名称= c(“国家”,“年”,“ X1”,“ X2”,“ X3”,“ X4”,“ X5”,“ running_total”),类=“ data.frame”,row.names = c (不适用-44L)

library(data.table)
setDT(df)
df[, xt := X1+X2+X3+X4+X5]
df[, rt2 := shift(cumsum(xt)), by = country]

Actually it can be solved with an one-liner: 实际上,可以通过单线解决:

df[, rt3 := {xt=X1+X2+X3+X4+X5; shift(cumsum(xt))}, by = country]
# Or as Ryan points out:
df[, rt2 := shift(cumsum(Reduce(`+`, .SD))) , by = country , .SDcols = grep('^X.*', names(df), value = T)]

All resulting in: 全部导致:

    country year X1 X2 X3 X4 X5 running_total xt rt2
 1: Bahamas 1990  0  0  0  0  1            NA  1  NA
 2: Bahamas 1991  0  0  1  1  0             1  2   1
 3: Bahamas 1992  1  1  0  0  1             3  3   3
 4: Bahamas 1993  0  0  0  0  0             6  0   6
 5: Bahamas 1994  1  1  0  1  1             6  4   6
 6: Bahamas 1995  0  0  1  0  0            10  1  10
 7: Bahamas 1996  0  1  0  1  0            11  2  11
 8: Bahamas 1997  1  0  1  0  1            13  3  13
 9: Bahamas 1998  0  1  0  1  0            16  2  16
10: Bahamas 1999  1  0  1  0  1            18  3  18
11: Bahamas 2000  0  1  0  1  0            21  2  21
12: Bahamas 2001  1  0  1  0  1            23  3  23
13: Bahamas 2002  0  1  0  1  0            26  2  26
14: Bahamas 2003  1  0  0  0  1            28  2  28
15: Bahamas 2004  0  0  0  1  0            30  1  30
16: Bahamas 2005  1  1  0  0  0            31  2  31
17: Bahamas 2006  0  0  1  1  1            33  3  33
18: Bahamas 2007  1  0  0  0  0            36  1  36
19: Bahamas 2008  0  0  1  1  1            37  3  37
20: Bahamas 2009  1  1  0  0  0            40  2  40
21: Bahamas 2010  0  0  1  1  1            42  3  42
22: Bahamas 2011  1  1  0  0  0            45  2  45
23: Bolivia 1990  0  0  0  0  0            NA  0  NA
24: Bolivia 1991  0  0  1  1  0             0  2   0
25: Bolivia 1992  0  0  0  0  0             2  0   2
26: Bolivia 1993  0  0  1  0  0             2  1   2
27: Bolivia 1994  0  0  0  0  0             3  0   3
28: Bolivia 1995  0  0  0  0  0             3  0   3
29: Bolivia 1996  0  0  0  0  0             3  0   3
30: Bolivia 1997  0  0  0  0  0             3  0   3
31: Bolivia 1998  0  0  0  0  0             3  0   3
32: Bolivia 1999  0  0  0  0  0             3  0   3
33: Bolivia 2000  0  1  0  1  0             3  2   3
34: Bolivia 2001  0  0  0  0  0             5  0   5
35: Bolivia 2002  0  0  0  0  0             5  0   5
36: Bolivia 2003  0  0  0  0  0             5  0   5
37: Bolivia 2004  0  0  0  0  0             5  0   5
38: Bolivia 2005  0  0  0  0  0             5  0   5
39: Bolivia 2006  0  0  0  0  0             5  0   5
40: Bolivia 2007  0  0  0  0  0             5  0   5
41: Bolivia 2008  0  0  0  0  1             5  1   5
42: Bolivia 2009  0  0  0  0  0             6  0   6
43: Bolivia 2010  0  0  0  0  1             6  1   6
44: Bolivia 2011  0  0  0  0  0             7  0   7
    country year X1 X2 X3 X4 X5 running_total xt rt2
df = structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bahamas", "Bolivia"), class = "factor"), year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L), X1 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X3 = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X4 = c(0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X5 = c(1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), running_total = c(NA, 1L, 3L, 6L, 6L, 10L, 11L, 13L, 16L, 18L, 21L, 23L, 26L, 28L, 30L, 31L, 33L, 36L, 37L, 40L, 42L, 45L, NA, 0L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L)), .Names = c("country", "year", "X1", "X2", "X3", "X4", "X5", "running_total"), class = "data.frame", row.names = c(NA, -44L))

df <- df %>% mutate(sums = X1 + X2 + X3 +X4 + X5) %>% 
  group_by(country) %>% mutate(sum_shift = shift(sums), 
                              sum_shift = ifelse(is.na(sum_shift), 0, sum_shift),
                              running_total = cumsum(sum_shift))

head(df)

country year    X1  X2 X3 X4 X5   running_total sums sum_shift
1: Bahamas 1990  0  0  0  0  1             0    1         0
2: Bahamas 1991  0  0  1  1  0             1    2         1
3: Bahamas 1992  1  1  0  0  1             3    3         2
4: Bahamas 1993  0  0  0  0  0             6    0         3
5: Bahamas 1994  1  1  0  1  1             6    4         0
6: Bahamas 1995  0  0  1  0  0            10    1         4

This is the dplyr solution but it is basically the same as the data table solution. 这是dplyr解决方案,但与数据表解决方案基本相同。 We create a column where we sum across the rows. 我们创建一列,在其中汇总各行。 Then we group by the country and and sum across and create a cumulative sum. 然后,我们按国家/地区分组并求和,然后创建一个累计和。 We have to set the nas to 0 for the cumulative sums to work. 为了使总和起作用,我们必须将nas设置为0。

A solution using dplyr and purrr . 使用dplyrpurrr解决方案。 We can split the data frame by country , create the running_total column, and then combine the data frames. 我们可以按country划分数据框,创建running_total列,然后组合数据框。 Notice that this solution does not need to specify individual column names, such as X1 and X2 . 请注意,此解决方案不需要指定单个列名,例如X1X2 dat2 is the final output. dat2是最终输出。

library(dplyr)
library(purrr)

dat2 <- dat %>%
  split(.$country) %>%
  map_dfr(~mutate(.x, 
                  running_total = 
                    as.integer(lag(cumsum(rowSums(select(.x, starts_with("X"))))))))

To calculate the running mean, we can follow the same logic by adding the command to the mutate function. 要计算移动平均值,我们可以通过向mutate函数添加命令来遵循相同的逻辑。 Notice that the cummean function is from the dplyr package. 请注意, cummean函数来自dplyr软件包。

dat2 <- dat %>%
  split(.$country) %>%
  map_dfr(~mutate(.x, 
                  running_total = 
                    as.integer(lag(cumsum(rowSums(select(.x, starts_with("X")))))),
                  running_mean =
                    lag(cummean(rowSums(select(.x, starts_with("X")))))))

DATA 数据

dat <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bahamas", "Bolivia"), class = "factor"), year = c(1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L), X1 = c(0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X3 = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X4 = c(0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X5 = c(1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), running_total = c(NA, 1L, 3L, 6L, 6L, 10L, 11L, 13L, 16L, 18L, 21L, 23L, 26L, 28L, 30L, 31L, 33L, 36L, 37L, 40L, 42L, 45L, NA, 0L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L)), .Names = c("country", "year", "X1", "X2", "X3", "X4", "X5", "running_total"), class = "data.frame", row.names = c(NA, -44L))

dat$running_total <- NULL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM