简体   繁体   English

R编程-将一组由ID索引的时间序列(具有不规则的观察期)分成定期的每月观察

[英]R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r. 我有一组数据,关于在r中的data.frame之间使用唯一ID的用户数量。

ID        start date         end date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
 ....
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000    

The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. 每个ID的时间序列在不一致的时间开始和结束,并且包含不一致的观察次数。 However, they are all formatted in the above manner; 但是,它们都是按照上述方式格式化的。 the start and end dates are Date objects. 开始日期和结束日期是Date对象。

I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly. 我想将每个ID的细分标准化为每个月的时间序列,并在每个月初使用数据点,权衡观察到的数量数字,这些数字恰好跨越了两个月或更长时间。 In other words, I would like to turn this series into something like 换句话说,我想将此系列变成类似

ID        start date         end date        amount
1         1-1-2012          2-1-2012       3096 = 6000 * 16/31
1         2-1-2012          3-1-2012       4339 = 6000*15/31+4000*14/39
1         3-1-2012          4-1-2012       etc
 ....
1         6-1-2012          7-1-2012       etc
2         1-1-2012          2-1-2012       etc
2         2-1-2012          3-1-2012       etc
2         3-1-2012          4-1-2012       etc
2         4-1-2012          5-1-2012       etc
2         5-1-2012          6-1-2012       etc
 ....
10000     1-1-2012          2-1-2012       etc
 ....
10000     6-1-2012          7-1-2012       etc

Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. 其中ID 1的值在2/1/12和3/1/12之间的计算方式是权衡2012年1月15日至2012年2月15日观察到的2月降落的天数(15天/ 31天),并将该观察范围(6000)中的值乘以2月份(14天/ 39天,因为2012年是leap年)在2-15到3-25观察范围中的天数乘以该值观察跨度(4000),得出6000 * 15/31 + 4000 * 14/39 =4339。应针对每个ID时间序列执行此操作。 We do not consider the case where the observation periods all fit into one month; 我们不考虑观察期都适合一个月的情况; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings. 但是,如果将它们摊开超过两个月,则应使用适当的权重在那几个月内进行分配。

I'm rather new to r and could certainly use some help on this! 我对r相当陌生,可以在此方面使用一些帮助!

Here is using native R: 这是使用本机R:

#The data
df=read.table(text='ID        start_date         end_date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000',
              header=T,row.names = NULL,stringsAsFactors =FALSE)

df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")

df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot

#using by compute new amount
df1$amount=do.call(c,
  by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
        )
df1

> df1
      ID     startm       endm    amount
1      1 2012-01-01 2012-02-01  3096.774
2      1 2012-02-01 2012-03-01  4339.123
3      1 2012-03-01 2012-05-01  4306.038
4      1 2012-05-01 2012-06-01  1535.842
5      2 2012-01-01 2012-02-01  2500.000
6      2 2012-02-01 2012-03-01  4700.000
7      2 2012-03-01 2012-05-01  3754.545
8  10000 2012-01-01 2012-02-01  5302.326
9  10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000

To solve this I think the easiest way is to break it down into two problems. 为了解决这个问题,我认为最简单的方法是将其分解为两个问题。

  1. How can I get a daily breakdown of the figures I'm interested in? 每天如何获得我感兴趣的数据的细分? This is my assumption based on the information you provided above. 这是我根据您上面提供的信息所做的假设。
  2. How do I group by a date range and summarise to what I'm interested in? 如何按日期范围分组并汇总到我感兴趣的内容?

For the following example, I will use the data set which I created using the code below: 对于以下示例,我将使用通过以下代码创建的数据集:

df <- data.frame(
  id=c(1,1,1,1,2,2,2),
  start_date=as.Date(c("1-15-2012",
                       "2-15-2012",
                       "3-25-2012",
                       "5-26-2012",
                       "1-16-2012",
                       "2-27-2012",
                       "3-18-2012"), "%m-%d-%Y"),
  end_date=as.Date(c("2-15-2012",
                     "3-25-2012",
                     "5-26-2012",
                     "6-13-2012",
                     "2-27-2012",
                     "3-18-2012",
                     "5-23-2012"), "%m-%d-%Y"),
  amount=c(6000,
           4000,
           3000,
           1000,
           7000,
           2000,
           3000)  
  )

1. Provide daily figures 1.提供每日数字

To provide the daily figures, firstly we get the daily contribution: 为了提供每日数据,首先我们得到每日贡献:

df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)  

Then, we will expand the date range using the start and end dates. 然后,我们将使用开始日期和结束日期来扩展日期范围。 There are a couple ways which you can do it , but seeing that you apply the dplyr tag, using the dplyr way we have: 您可以通过几种方法来做到这一点 ,但是看到您使用dplyr方法应用了dplyr标签:

library(dplyr)
df <- df %>%
  rowwise() %>%
  do(data.frame(id=.$id, 
                date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")), 
                daily_contribution=.$daily_contribution))

which has some output which looks like this: 其输出如下所示:

Source: local data frame [285 x 3]
Groups: <by row>

   id       date daily_contribution
1   1 2012-01-15           193.5484
2   1 2012-01-16           193.5484
3   1 2012-01-17           193.5484
4   1 2012-01-18           193.5484
5   1 2012-01-19           193.5484
6   1 2012-01-20           193.5484
7   1 2012-01-21           193.5484
8   1 2012-01-22           193.5484
9   1 2012-01-23           193.5484
10  1 2012-01-24           193.5484
.. ..        ...                ...

2. Create a grouping variable 2.创建一个分组变量

Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates: 接下来,我们创建某种我们感兴趣的分组变量。为了方便获取日期的月份和年份,我使用了lubridate

library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)

Now with all of this we can easily use dplyr to summarise our information by the dates as required. 现在,通过所有这些,我们可以轻松地使用dplyr根据需要的日期汇总我们的信息。

df %>% 
  group_by(id, mnth, yr) %>%
  summarise(amount=sum(daily_contribution))

with output: 输出:

Source: local data frame [11 x 4]
Groups: id, mnth

   id mnth   yr    amount
1   1    1 2012 3290.3226
2   1    2 2012 4441.6873
3   1    3 2012 2902.8122
4   1    4 2012 1451.6129
5   1    5 2012 1591.3978
6   1    6 2012  722.2222
7   2    1 2012 2666.6667
8   2    2 2012 4800.0000
9   2    3 2012 2436.3636
10  2    4 2012 1363.6364
11  2    5 2012 1045.4545

To get it precisely in the format you specified: 要以指定的格式精确获取它:

df %>% rowwise() %>%
  mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
         end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
  select(id, start_date, end_date, amount)

with output: 输出:

Source: local data frame [11 x 4]
Groups: <by row>

   id start_date   end_date    amount
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

as needed. 如所须。

note : I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39 , but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. 注意 :从您的示例中可以看到,您有3096 = 6000 * 16/314339 = 6000*15/31+4000*14/39 3096 = 6000 * 16/31 4339 = 6000*15/31+4000*14/39 3096 = 6000 * 16/31 ,但是对于第一个示例,您有1月15日1月31日至1月31日(如果包含日期范围,则为17天)。 You can trivially alter this information if required. 如果需要,您可以更改此信息。

Here's a solution using plyr and reshape . 这是使用plyrreshape的解决方案。 The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month). 这些数字与您提供的数字不同,因此尽管这似乎达到了您设定的目标(按月加权平均金额),但我可能误解了您的意图。

df$index <- 1:nrow(df) #Create a unique index number

#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")

library(plyr); library(reshape)  #Load the libraries

#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
#    ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) { 
  ID <- X$ID        #Keep the ID value
  n.days <- as.numeric(difftime( X$end.date, X$start.date ))  #Calculate time difference in days, report the result as a number
  day <- seq(X$start.date, X$end.date, by="days")   #Sequence of days
  amount.per.day <- X$amount/n.days      #Amount for that day
  data.frame(ID, day, amount.per.day)    #Last line is the output
})

#Change list back into data.frame
df3 <- ldply(df2, data.frame)   #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <-  as.numeric(format(df3$day, "%m"))   #Assign a month to all dates

#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))

#       ID mon    amount
#    1   1   1 3290.3226
#    2   1   2 4441.6873
#    3   1   3 2902.8122
#    4   1   4 1451.6129
#    5   1   5 1591.3978
#    6   1   6  722.2222
#    7   2   1 2666.6667
#    8   2   2 4800.0000
#    9   2   3 2436.3636
#    10  2   4 1363.6364
#    11  2   5 1045.4545

Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. 顺便提一句,对于以后的帖子,如果提供用于复制数据的代码,则可以获得更快的答案。 If your code is somewhat complicated, you can use dput(yourdata) . 如果您的代码有些复杂,则可以使用dput(yourdata) HTH! HTH!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM