简体   繁体   English

将季度数据分解为R保持值的每日数据?

[英]Disaggregate quarterly data to daily data in R keeping values?

How does one easily disaggregate quarterly data to daily data? 如何轻松地将季度数据分解为日常数据? In this case I'm using 10 years of US GDP data which have quarterly observations, and I want to expand the data frame to daily level, carrying over the GDP value each day until the next observation. 在这种情况下,我使用了10年的美国GDP数据,这些数据有季度观察结果,我希望将数据框架扩展到每日水平,每天持续GDP值,直到下一次观察。

Reprex table: Reprex表:

structure(list(thedate = structure(c(14426, 14518, 14610, 14700, 
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522, 
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344, 
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167, 
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9, 
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5, 
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5, 
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA, 
-40L))

We see above: 我们在上面看到:

2009-07-01 | 1.5
2009-10-01 | 4.5

The intended output would look like: 预期的输出看起来像:

2009-07-01 | 1.5
2009-07-02 | 1.5
2009-07-03 | 1.5
etc.
2009-10-01 | 4.5
2009-10-02 | 4.5
2009-10-03 | 4.5

Here is a tidyr and zoo package answer that uses 'last observation carried forward' after inserting a sequence of dates with NA: 这是一个tidyr和动物园包的答案,使用NA插入一系列日期后使用“最后一次观察结转”:

library(tidyverse)
library(zoo)

data %>%
  complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
  do(na.locf(.))

Edit : Thanks to Shree for reminding that tidyr::fill would eliminate need for zoo: 编辑 :感谢Shree提醒tidyr :: fill将消除对动物园的需求:

library(tidyverse)

data %>%
  complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
  fill(gdp)
library(lubridate)
d2 = do.call(rbind, lapply(2:NROW(d), function(i){
    data.frame(date = head(seq.Date(d$thedate[i-1], d$thedate[i], "days"), -1),
               gdp = d$gdp[i - 1])
}))
head(d2)
        date gdp
1 2009-07-01 1.5
2 2009-07-02 1.5
3 2009-07-03 1.5
4 2009-07-04 1.5
5 2009-07-05 1.5
6 2009-07-06 1.5
tail(d2)
           date gdp
3556 2019-03-26 3.1
3557 2019-03-27 3.1
3558 2019-03-28 3.1
3559 2019-03-29 3.1
3560 2019-03-30 3.1
3561 2019-03-31 3.1

Here's a base solution: 这是一个基本解决方案:

last_quarter_end_date <- seq.Date(df$thedate[nrow(df)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(df$thedate, last_quarter_end_date))

data.frame(thedate = rep(df$thedate, seqs) + sequence(seqs)-1
           , gdp = rep(df$gdp, seqs))

Basically, the difference between dates is how many times you need to repeat a GDP column. 基本上,日期之间的差异是您需要重复GDP列的次数。 Also, I can do seq_len() for each difference to add back to the original date. 此外,我可以为每个差异执行seq_len()以添加回原始日期。

Performance This approach is efficient although I'll note that 0.6 ms isn't really much different than 15 ms in the big picture. 性能这种方法是有效的,虽然我会注意到0.6 ms与大图中的15 ms并没有太大差别。

Unit: microseconds
      expr     min       lq      mean  median       uq     max neval
 cole_base   528.1   554.15   690.379   644.9   663.75  3225.7   100
  d_b_base 15735.0 15994.40 17395.754 16243.9 18108.30 38761.8   100
 Ben_tidyr  2808.7  2936.40  3356.324  3076.6  3149.65  8065.1   100

Complete code for reference: 完整的参考代码:

DF <- structure(list(thedate = structure(c(14426, 14518, 14610, 14700, 
                                           14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522, 
                                           15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344, 
                                           16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167, 
                                           17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9, 
                            -0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5, 
                            5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5, 
                            2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA, 
                                                                                                -40L))

library(microbenchmark)
library(tidyr)

microbenchmark(cole_base = {
  last_quarter_end_date <- seq.Date(DF$thedate[nrow(DF)], by = 'quarter', length.out = 2)[-1]-1
  seqs <- diff(c(DF$thedate, last_quarter_end_date))

  data.frame(thedate = rep(DF$thedate, seqs) + sequence(seqs)-1
             , gdp = rep(DF$gdp, seqs))
}
, d_b_base = {
  do.call(rbind, lapply(2:NROW(DF), function(i){
    data.frame(date = head(seq.Date(DF$thedate[i-1], DF$thedate[i], "days"), -1),
               gdp = DF$gdp[i - 1])
     }))
}
, Ben_tidyr = {
  DF %>%
    complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
    fill(gdp)
}
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM