简体   繁体   English

如何插入过去的值来代替缺失的月份值?

[英]How to insert past values in place of missing month values?

I have a dataframe with 2 columns (Date and Values).我有一个包含 2 列(日期和值)的数据框。 I would like the frequency of observations to be monthly.我希望观察的频率是每月一次。 However, I don't have observations for all the months.但是,我没有观察所有月份。 I want to add as a missing observation the value at time t-1.我想将时间 t-1 的值添加为缺失的观察值。

Here an example to make it clearer:这是一个更清楚的例子:

df <- data.frame(Date = c("2000-01-05", "2000-02-03", "2000-03-02", "2000-04-13", "2000-05-11", "2000-06-08", "2000-07-06", "2000-09-14", "2000-10-05", "2000-11-02", "2000-12-14", "2001-02-01", "2001-03-01", "2001-04-11", "2001-05-10", "2001-06-07", "2001-07-05", "2001-08-30", "2001-10-11", "2001-11-08", "2001-12-06", "2002-01-03", "2002-02-07", "2002-03-07", "2002-04-04", "2002-05-02", "2002-06-06", "2002-07-04", "2002-09-12", "2002-10-10"), Fit = c( 1.00000000, -1.00000000, -0.81612680, -0.42769496,  1.00000000, -1.50947974, -0.02154276, -1.47427092, -1.46782501, -1.17309887, -0.70347628, -1.93465483, -3.00667550, -1.55652236, -4.10292471, -1.10159442, -2.64296439, -2.03574462, -1.55986632, -1.73125990, -1.34045640, -2.01864867, -2.51081773, -3.07896217, -3.02724723, -0.76456774, -1.81459657, -2.13093106, -1.91543051, -1.31418467))

   Date         Fit
1  2000-01-05  1.00000000
2  2000-02-03 -1.00000000
3  2000-03-02 -0.81612680
4  2000-04-13 -0.42769496
5  2000-05-11  1.00000000
6  2000-06-08 -1.50947974
7  2000-07-06 -0.02154276
8  2000-09-14 -1.47427092
9  2000-10-05 -1.46782501
10 2000-11-02 -1.17309887
11 2000-12-14 -0.70347628
12 2001-02-01 -1.93465483
13 2001-03-01 -3.00667550
14 2001-04-11 -1.55652236
15 2001-05-10 -4.10292471
16 2001-06-07 -1.10159442
17 2001-07-05 -2.64296439
18 2001-08-30 -2.03574462
19 2001-10-11 -1.55986632
20 2001-11-08 -1.73125990
21 2001-12-06 -1.34045640
22 2002-01-03 -2.01864867
23 2002-02-07 -2.51081773
24 2002-03-07 -3.07896217
25 2002-04-04 -3.02724723
26 2002-05-02 -0.76456774
27 2002-06-06 -1.81459657
28 2002-07-04 -2.13093106
29 2002-09-12 -1.91543051
30 2002-10-10 -1.31418467

# by running:

lapply(split(df, format(as.Date(df$Date), "%Y")), function(x) month.name[setdiff(seq(12), as.numeric(format(as.Date(x$Date), "%m")))])

# you will be able to see the missing month to get monthly frequency

I want to get this:我想得到这个:

   Date         Fit
1  2000-01-05  1.00000000
2  2000-02-03 -1.00000000
3  2000-03-02 -0.81612680
4  2000-04-13 -0.42769496
5  2000-05-11  1.00000000
6  2000-06-08 -1.50947974
7  2000-07-06 -0.02154276
8  2000-08-06 -0.02154276
8  2000-09-14 -1.47427092
9  2000-10-05 -1.46782501
10 2000-11-02 -1.17309887
11 2000-12-14 -0.70347628
11 2000-01-15 -0.70347628
12 2001-02-01 -1.93465483
13 2001-03-01 -3.00667550
14 2001-04-11 -1.55652236
15 2001-05-10 -4.10292471
16 2001-06-07 -1.10159442
17 2001-07-05 -2.64296439
18 2001-08-30 -2.03574462
19 2001-09-30 -2.03574462
19 2001-10-11 -1.55986632
20 2001-11-08 -1.73125990
21 2001-12-06 -1.34045640
22 2002-01-03 -2.01864867
23 2002-02-07 -2.51081773
24 2002-03-07 -3.07896217
25 2002-04-04 -3.02724723
26 2002-05-02 -0.76456774
27 2002-06-06 -1.81459657
28 2002-07-04 -2.13093106
28 2002-08-04 -2.13093106
29 2002-09-12 -1.91543051
30 2002-10-10 -1.31418467

As you can see, every missing month has been replaced with the value of the previous month.如您所见,每个缺失的月份都被替换为上个月的值。

Can anyone help me do it?有人可以帮我做吗?

Thanks so much!非常感谢!

df <- data.frame(Date = c("2000-01-05", "2000-02-03", "2000-03-02", "2000-04-13", "2000-05-11", "2000-06-08", "2000-07-06", "2000-09-14", "2000-10-05", "2000-11-02", "2000-12-14", "2001-02-01", "2001-03-01", "2001-04-11", "2001-05-10", "2001-06-07", "2001-07-05", "2001-08-30", "2001-10-11", "2001-11-08", "2001-12-06", "2002-01-03", "2002-02-07", "2002-03-07", "2002-04-04", "2002-05-02", "2002-06-06", "2002-07-04", "2002-09-12", "2002-10-10"), Fit = c( 1.00000000, -1.00000000, -0.81612680, -0.42769496,  1.00000000, -1.50947974, -0.02154276, -1.47427092, -1.46782501, -1.17309887, -0.70347628, -1.93465483, -3.00667550, -1.55652236, -4.10292471, -1.10159442, -2.64296439, -2.03574462, -1.55986632, -1.73125990, -1.34045640, -2.01864867, -2.51081773, -3.07896217, -3.02724723, -0.76456774, -1.81459657, -2.13093106, -1.91543051, -1.31418467))

Suggested solution using dplyr and tidyr :使用dplyrtidyr建议解决方案:

library(dplyr)
library(tidyr)
df <- df %>% 
  mutate(Date = as.Date(as.character(Date, "%Y-%m-%d")),
       Month = format(Date, "%m"),
       Year = format(Date, "%Y"))  %>% 
  complete(Month = formatC(1:12, 1, flag=0), nesting(Year)) %>% 
  mutate(Date = if_else(is.na(Date), as.Date(paste(Year, Month, "1", sep="-"), "%Y-%m-%d"), Date))%>% 
  arrange(Date) %>% 
  select(Date, Fit) %>% 
  mutate(Fit = if_else(is.na(Fit), lag(Fit), Fit)) %>% 
  mutate(Fit = if_else(is.na(Fit), lag(Fit), Fit)) # Do this twice as a 'hackish' solution for the last value

Returns:返回:

Date         Fit
1  2000-01-05  1.00000000
2  2000-02-03 -1.00000000
3  2000-03-02 -0.81612680
4  2000-04-13 -0.42769496
5  2000-05-11  1.00000000
6  2000-06-08 -1.50947974
7  2000-07-06 -0.02154276
8  2000-08-01 -0.02154276
9  2000-09-14 -1.47427092
10 2000-10-05 -1.46782501
11 2000-11-02 -1.17309887
12 2000-12-14 -0.70347628
13 2001-01-01 -0.70347628
14 2001-02-01 -1.93465483
15 2001-03-01 -3.00667550
16 2001-04-11 -1.55652236
17 2001-05-10 -4.10292471
18 2001-06-07 -1.10159442
19 2001-07-05 -2.64296439
20 2001-08-30 -2.03574462
21 2001-09-01 -2.03574462
22 2001-10-11 -1.55986632
23 2001-11-08 -1.73125990
24 2001-12-06 -1.34045640
25 2002-01-03 -2.01864867
26 2002-02-07 -2.51081773
27 2002-03-07 -3.07896217
28 2002-04-04 -3.02724723
29 2002-05-02 -0.76456774
30 2002-06-06 -1.81459657
31 2002-07-04 -2.13093106
32 2002-08-01 -2.13093106
33 2002-09-12 -1.91543051
34 2002-10-10 -1.31418467
35 2002-11-01 -1.31418467
36 2002-12-01 -1.31418467

I first formatted your data.frame, then created a helper df and joined this by a character string id.我首先格式化了你的 data.frame,然后创建了一个 helper df 并通过一个字符串 id 加入了它。

library(dplyr)
# Format to date and create an ID
df <- df %>%
  mutate(Date = as.Date(Date, format = "%Y-%m-%d")) %>% 
  mutate(id = substr(as.character(Date),1,7 ))

# Create a sequence from you min and max dates in your original df.
# Also, add an ID column for the join.
df_helper <- 
  data.frame(Date= seq(min(as.Date(df$Date)), max(as.Date(df$Date)), by = "month")) %>%
  mutate(id = substr(as.character(Date),1,7 ))

# Perform the join and fill
new_df <- df_helper %>% left_join(df, by ="id") %>% 
  select(Date.x, id, Fit) %>% 
  rename( Date = Date.x) %>% 
  fill(Fit)

Here's a solution in base R style using the lubridate package:这是使用lubridate包的基本 R 风格的lubridate方案:

library(lubridate)

df$Date       <- as.POSIXct(as.character(df$Date))
added         <- df[which(month(df$Date[-nrow(df)]) != month(df$Date[-1]) - 1), ]
added$Date    <- added$Date + months(1)
df            <- rbind(df, added)
df            <- df[order(df$Date),]
row.names(df) <- seq(nrow(df))

result:结果:

df
#>          Date         Fit
#> 1  2000-01-05  1.00000000
#> 2  2000-02-03 -1.00000000
#> 3  2000-03-02 -0.81612680
#> 4  2000-04-13 -0.42769496
#> 5  2000-05-11  1.00000000
#> 6  2000-06-08 -1.50947974
#> 7  2000-07-06 -0.02154276
#> 8  2000-08-06 -0.02154276
#> 9  2000-09-14 -1.47427092
#> 10 2000-10-05 -1.46782501
#> 11 2000-11-02 -1.17309887
#> 12 2000-12-14 -0.70347628
#> 13 2001-01-14 -0.70347628
#> 14 2001-02-01 -1.93465483
#> 15 2001-03-01 -3.00667550
#> 16 2001-04-11 -1.55652236
#> 17 2001-05-10 -4.10292471
#> 18 2001-06-07 -1.10159442
#> 19 2001-07-05 -2.64296439
#> 20 2001-08-30 -2.03574462
#> 21 2001-09-30 -2.03574462
#> 22 2001-10-11 -1.55986632
#> 23 2001-11-08 -1.73125990
#> 24 2001-12-06 -1.34045640
#> 25 2002-01-03 -2.01864867
#> 26 2002-01-06 -1.34045640
#> 27 2002-02-07 -2.51081773
#> 28 2002-03-07 -3.07896217
#> 29 2002-04-04 -3.02724723
#> 30 2002-05-02 -0.76456774
#> 31 2002-06-06 -1.81459657
#> 32 2002-07-04 -2.13093106
#> 33 2002-08-04 -2.13093106
#> 34 2002-09-12 -1.91543051
#> 35 2002-10-10 -1.31418467

Created on 2020-02-24 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 2 月 24 日创建

Another dplyr + tidyr + lubridate solution:另一个dplyr + tidyr + lubridate解决方案:

library(dplyr)
library(tidyr)
library(lubridate)

df %>%
  mutate(Date = as.Date(Date),
         month = floor_date(Date, "month")) %>%
  right_join(tibble(month = seq.Date(min(.$month), max(.$month), by = "month"))) %>%
  mutate(day = day(Date)) %>%
  fill(Fit, day) %>%
  mutate(Date = month %m+% days(day-1)) %>%
  select(Date, Fit)

Result:结果:

         Date         Fit
1  2000-01-05  1.00000000
2  2000-02-03 -1.00000000
3  2000-03-02 -0.81612680
4  2000-04-13 -0.42769496
5  2000-05-11  1.00000000
6  2000-06-08 -1.50947974
7  2000-07-06 -0.02154276
8  2000-08-06 -0.02154276
9  2000-09-14 -1.47427092
10 2000-10-05 -1.46782501
11 2000-11-02 -1.17309887
12 2000-12-14 -0.70347628
13 2001-01-14 -0.70347628
14 2001-02-01 -1.93465483
15 2001-03-01 -3.00667550
16 2001-04-11 -1.55652236
17 2001-05-10 -4.10292471
18 2001-06-07 -1.10159442
19 2001-07-05 -2.64296439
20 2001-08-30 -2.03574462
21 2001-09-30 -2.03574462
22 2001-10-11 -1.55986632
23 2001-11-08 -1.73125990
24 2001-12-06 -1.34045640
25 2002-01-03 -2.01864867
26 2002-02-07 -2.51081773
27 2002-03-07 -3.07896217
28 2002-04-04 -3.02724723
29 2002-05-02 -0.76456774
30 2002-06-06 -1.81459657
31 2002-07-04 -2.13093106
32 2002-08-04 -2.13093106
33 2002-09-12 -1.91543051
34 2002-10-10 -1.31418467

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM