简体   繁体   English

如何将data.frame分组几周然后求和?

[英]How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following 假设我有几年的数据,如下所示

# load date package and set random seed
library(lubridate)
set.seed(42)

# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date, 
                 wday = wday(date),
                 wday.name = wday(date, label = TRUE, abbr = TRUE),
                 income = round(runif(21, 0, 100)),
                 week = format(date, format="%Y-%U"),
                 stringsAsFactors = FALSE)

#          date wday wday.name income    week
# 1  2010-12-26    1       Sun     91 2010-52
# 2  2010-12-27    2       Mon     94 2010-52
# 3  2010-12-28    3      Tues     29 2010-52
# 4  2010-12-29    4       Wed     83 2010-52
# 5  2010-12-30    5     Thurs     64 2010-52
# 6  2010-12-31    6       Fri     52 2010-52
# 7  2011-01-01    7       Sat     74 2011-00
# 8  2011-01-02    1       Sun     13 2011-01
# 9  2011-01-03    2       Mon     66 2011-01
# 10 2011-01-04    3      Tues     71 2011-01
# 11 2011-01-05    4       Wed     46 2011-01
# 12 2011-01-06    5     Thurs     72 2011-01
# 13 2011-01-07    6       Fri     93 2011-01
# 14 2011-01-08    7       Sat     26 2011-01
# 15 2011-01-09    1       Sun     46 2011-02
# 16 2011-01-10    2       Mon     94 2011-02
# 17 2011-01-11    3      Tues     98 2011-02
# 18 2011-01-12    4       Wed     12 2011-02
# 19 2011-01-13    5     Thurs     47 2011-02
# 20 2011-01-14    6       Fri     56 2011-02
# 21 2011-01-15    7       Sat     90 2011-02

I would like to sum 'income' for each week (Sunday thru Saturday). 我想把每周(周日到周六)的“收入”加起来。 Currently I do the following: 目前我做以下事情:

Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443

However I would like a more robust approach which will automatically sum by week. 但是,我想要一个更健壮的方法,它将自动按周计算。 I can't work out how to automatically subset the data into weeks. 我无法弄清楚如何将数据自动分组为几周。 Any help would be much appreciated. 任何帮助将非常感激。

First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries: 首先使用format将日期转换为周数,然后使用plyr::ddply()来计算摘要:

library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
     week income
1 2011-52    413
2 2012-01    435
3 2012-02    379

For more information on format.date , see ?strptime , particular the bit that defines %U as the week number. 有关format.date更多信息,请参阅?strptime ,特别是将%U定义为周数的位。


EDIT: 编辑:

Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. 鉴于修改后的数据和要求,一种方法是将日期除以7以得到表示星期的数字。 (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default. (或者更确切地说,除以一周内的秒数来获得自纪元以来的周数,默认情况下是1970-01-01。

In code: 在代码中:

df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))

        week income
1 2010-12-23    298
2 2010-12-30    392
3 2011-01-06    294
4 2011-01-13    152

I have not checked that the week boundaries are on Sunday. 我没有检查星期日的星期界限。 You will have to check this, and insert an appropriate offset into the formula. 您必须检查此项,并在公式中插入适当的偏移量。

This is now simple using dplyr. 现在使用dplyr很简单。 Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks. 另外我建议使用cut(breaks = "week")而不是format()来将日期缩短为几周。

library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))

I Googled "group week days into weeks R" and came across this SO question . 我用谷歌搜索“团体周日到周R”并遇到了这个问题 You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y") 你提到你有多年,所以我认为我们需要跟上周数和年份,所以我修改了那里的答案format(date, format = "%U%y")

In use it looks like this: 在使用中它看起来像这样:

library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
  weeknum suminc
1    1152    413
2    1201    435
3    1202    379

See ?strptime for all the format abbreviations. 有关所有格式缩写,请参阅?strptime

Try rollapply from the zoo package: zoo包中尝试rollapply

rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443

Or, use period.sum from the xts package: 或者,使用xts包中的period.sum

period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
#            [,1]
# 2011-01-01  487
# 2011-01-08  387
# 2011-01-15  443

Or, to get the output in the format you want: 或者,以您想要的格式获取输出:

data.frame(income = period.sum(xts(df$income, order.by=df$date), 
                               which(df$wday %in% 7)),
           week = df$week[which(df$wday %in% 7)])
#            income    week
# 2011-01-01    487 2011-00
# 2011-01-08    387 2011-01
# 2011-01-15    443 2011-02

Note that the first week shows as 2011-00 because that's how it is entered in your data. 请注意,第一周显示为2011-00因为这是在数据中输入的方式。 You could also use week = df$week[which(df$wday %in% 1)] which would match your output. 您还可以使用week = df$week[which(df$wday %in% 1)]这将匹配您的输出。

This solution is influenced by @Andrie and @Chase. 这个解决方案受到@Andrie和@Chase的影响。

# load plyr 
library(plyr)

# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")

# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))

# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)

#      week income week.ending
# 1 2010-52    487  2011-01-01
# 2 2011-01    387  2011-01-08
# 3 2011-02    443  2011-01-15

df.index = df['week'] #the the dt variable as index df.index = df ['week']#将dt变量作为索引

df.resample('W').sum() #sum using resample df.resample('W')。sum()#sum使用resample

With dplyr: 使用dplyr:

df %>% 
  arrange(date) %>%
  mutate(week = as.numeric(date - date[1])%/%7) %>%
  group_by(week) %>%
  summarise(weekincome= sum(income))

Instead of date[1] you can have any date from when you want to start your weekly study. 而不是日期[1],您可以从您希望开始每周学习的任何日期开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM