简体   繁体   English

填充 R 中缺失的日期

[英]Filling missing dates in R

I would like some help regarding a data frame transformation required for an analysis.我想要一些关于分析所需的数据框转换的帮助。 My data consists of a large amount of individuals with all their historic employment.我的数据由大量个人及其所有历史工作组成。 "EX" is a code representing the reason for ending employment. “EX”是代表终止雇佣原因的代码。 Something like this:像这样的东西:

id  Date_start    Date_end       EX
13  "2001-02-01"  "2001-05-30"   A
13  "2002-03-01"  "2010-06-02"   B
14  ...           ...
...

So what I would like to do is to "fill in the gaps".所以我想做的是“填补空白”。 This may not be easy but its even more difficult because I want it aggregated by id and each new row should have the EX value of the row before, like this:这可能并不容易,但更难,因为我希望它按 id 聚合,并且每个新行之前都应该具有该行的 EX 值,如下所示:

id  Date_start    Date_end       EX
13  "2001-02-01"  "2001-05-30"   A
13  "2001-05-31"  "2002-02-28"   A
13  "2002-03-01"  "2010-06-02"   B
14  ...           ...
...

I believe the trick would be some kind of lag and aggregate but I'm totally lost.我相信诀窍会是某种滞后和聚合,但我完全迷失了。

This is a little bit tricky, and you can mainly utilize the dplyr package to do the manipulation and lubridate packages to convert the date format(you can use as.Date() for sure, but lubridate makes it easier).这有点棘手,您可以主要利用dplyr包进行操作和lubridate包来转换日期格式(您可以肯定使用as.Date() ,但lubridate使它更容易)。

library(dplyr)
library(lubridate)

1.Creating the sample data you provided. 1.创建您提供的示例数据。

names <- c("id", "Date_start",    "Date_end",       "EX")
row1 <- c(13 , "2001-02-01" , "2001-05-30" ,  "A")
row2 <- c(13 , "2002-03-01" , "2010-06-02" ,  "B")


testdata <- rbind(row1,row2) %>% data.frame(stringsAsFactors = F)
row.names(testdata) <- NULL

names(testdata) <- names

testdata$Date_start <- testdata$Date_start %>% as_date()
testdata$Date_end <- testdata$Date_end %>% as_date()
testdata

2.Creating a new data set that has the data you want to add. 2.创建一个包含您要添加的数据的新数据集。

id : we are using the same id value since it is grouping by id. id :我们使用相同的 id 值,因为它是按 id 分组的。
Date_start : we are creating the Date_start with a value if there is gap, otherwise "" (empty column, and we are filtering them out). Date_start :如果存在间隙,我们将使用一个值创建 Date_start ,否则为“”(空列,我们将它们过滤掉)。
Date_end : Same logic for Date_end. Date_end :Date_end 的逻辑相同。
EX : we are using the second last EX value as you stated. EX :我们使用您所说的倒数第二个 EX 值。

  new_data <- test_data %>% 
  group_by(id) %>% 
  mutate(Date_start1 = ifelse(Date_start-lag(Date_end) == 1,0,lag(Date_end)+1),
         Date_end1 = ifelse(Date_start-lag(Date_end) == 1,0,Date_start-1),
         EX=first(EX)) %>% 
  filter(!Date_start1 ==0) %>% 
  select(id, Date_start=Date_start1,Date_end=Date_end1,EX) %>% 
  distinct() %>% 
  ungroup()

3.Since we want to fill the gap days, mutate made it into numeric value, and we are using as_date() from lubriate to convert it into date format. 3.由于我们想填补空缺天数,mutate把它变成了数值,我们使用as_date()将其转换成日期格式。

new_data$Date_start <- as_date(new_data$Date_start)
new_data$Date_end <- as_date(new_data$Date_end)

4.Combine it with your sample data and arrange it by Date_state. 4.结合你的样本数据,按Date_state排列。

final <- rbind(testdata,new_data) %>% data.frame() %>% arrange(Date_start)
final

Your final result is as below.您的最终结果如下。

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM