简体   繁体   中英

How to get sum of values every 8 days by date in data frame in R

I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?

Any help you can provide will be greatly appreciated!

 str(temp)
'data.frame':648 obs. of  2 variables:
 $ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ conv2: num  -3.93 -6.44 -5.48 -6.09 -7.46 ...

head(temp)
Date              amount
24/03/2001  -3.927020472
25/03/2001  -6.4427004
26/03/2001  -5.477592528
27/03/2001  -6.09462162
28/03/2001  -7.45666902
29/03/2001  -6.731540928
30/03/2001  -6.855206184
31/03/2001  -6.807210228
1/04/2001   -5.40278802

I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:

z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)

I prefer the package xts for such manipulations.

  1. I read your data, as zoo objects. see the flexibility of format option.

     library(xts) ts.dat <- read.zoo(text ='Date amount 24/03/2001 -3.927020472 25/03/2001 -6.4427004 26/03/2001 -5.477592528 27/03/2001 -6.09462162 28/03/2001 -7.45666902 29/03/2001 -6.731540928 30/03/2001 -6.855206184 31/03/2001 -6.807210228 1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y') 
  2. Then I extract the index of given period

     ep <- endpoints(ts.dat,'days',k=8) 
  3. finally I apply my function to the time series at each index.

     period.apply(x=ts.dat,ep,FUN=sum ) 2001-03-29 2001-04-01 -36.13014 -19.06520 

Use cut() in your aggregate() command.

Some sample data:

set.seed(1)
mydf <- data.frame(
    DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
    VALS = runif(365, -5, 5))

Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut :

output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
#   cut(DATE, "8 days")      VALS
# 1          2000-01-01  8.242384
# 2          2000-01-09 -5.879011
# 3          2000-01-17  7.910816
# 4          2000-01-25 -6.592012
# 5          2000-02-02  2.127678
# 6          2000-02-10  6.236126
# 
# [[2]]
#    cut(DATE, "8 days")       VALS
# 41          2000-11-16 17.8199285
# 42          2000-11-24 -0.3772209
# 43          2000-12-02  2.4406024
# 44          2000-12-10 -7.6894484
# 45          2000-12-18  7.5528077
# 46          2000-12-26 -3.5631950

Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :

> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
        Date    amount
1  1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210

But to get it in 8 day intervals use cut.Date :

> z <- aggregate(temp$amount , 
                 list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"), 
                 breaks="8 day")), FUN=sum)
> z
         Dts          x
1 2001-03-24 -49.792561
2 2001-04-01  -5.402788

rollapply . The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:

library(zoo)
zz <- read.zoo(temp)

and then its just:

rollapply(zz, 8, sum, by = 8)

Drop the by = 8 if you want a rolling total instead.

(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo .)

aggregate . Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.

ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)

Note that ix looks like this:

> ix
[1]  8  8  8  8  8  8  8  8 16

so it groups the indices of the first 8 rows, the second 8 and so on.

A more cleaner approach extended to @G. Grothendieck appraoch. Note : It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.


code

  interval = 8 # your desired date interval. 2 days, 3 days or whatevea 
  enddate = interval-1 # this sets the enddate
  nrows = nrow(z)
  z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
  z$V1 <- as.Date(z$V1)
  data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
               End.date =  z[seq(1, nrows, interval)+enddate,1],
               Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))

output

   Start.date   End.date   Total.sum
1  2000-01-01 2000-01-08   9.1395926
2  2000-01-09 2000-01-16  15.0343960
3  2000-01-17 2000-01-24   4.0974712
4  2000-01-25 2000-02-01   4.1102645
5  2000-02-02 2000-02-09 -11.5816277

data

  df <- data.frame(
  V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
  V2 = runif(365, -5, 5))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM