[英]For loop generating months between dates in R

I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). 我有一个数据框,它有三列employeeid,开始日期(ydm)和结束日期(ydm)。 my objective was to create another data frame which has two columns, one is employee ID and the other one is date. 我的目标是创建另一个包含两列的数据框,一列是员工ID,另一列是日期。 Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. 第二个数据框将围绕第一个数据框构建,这样它将从第一个数据框获取ID,并且列日期将占用该员工的开始日期和结束日期之间的所有月份。 In simple words , i would expand the data in first data frame by months according to the employee start date and end date. 简而言之,我将根据员工的开始日期和结束日期按月将第一个数据框中的数据扩展。

I actually successfully created the code, using for loop. 我实际上使用for循环成功创建了代码。 Problem is, it is very slower, and some where I read that one is to avoid loops in r. 问题是,它非常慢,在某些地方我读到它是为了避免r中的循环。 is there a way that can do the same in a much quicker way ? 有没有一种方法可以更快地完成相同的工作?

an example of my data frame and code is below: 我的数据框和代码的示例如下:

# Creating Data frame
    a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
                   EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
    a$StartDate <- ydm(a$StartDate)
    a$EndDate <- ydm(a$EndDate)

    #second empty data frame
    a1 <-a
    a1 <- a1[0,1:2]

    #my code starts
    r <- 1
    r.1 <- 1
    for (id in a$employeeid) {

      #r.1 <- 1
      for ( i  in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) { 
        a1[r.1,1] <- a[r,1]
        a1[r.1,2] <- i
        r.1 <- r.1 +1  
      r <- r+1

This results in this : 结果是:


I want the same result, but a bit quicker 我想要相同的结果,但是要快一点

Almost a one-liner with tidyverse : 几乎有一个tidyverse

> result
# A tibble: 12 x 2
   employeeid date      
   <chr>      <date>    
 1 a          2018-01-01
 2 a          2018-02-01
 3 a          2018-03-01
 4 b          2018-05-01
 5 b          2018-06-01
 6 b          2018-07-01
 7 b          2018-08-01
 8 b          2018-09-01
 9 c          2018-11-01
10 c          2018-12-01
11 c          2019-01-01
12 c          2019-02-01


result <- df %>%
    group_by(employeeid) %>%
    summarise(date = list(seq(StartDate,
                              by = "month"))) %>%

Data 数据

df <- data.frame(employeeid = c('a', 'b', 'c'), 
                 StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
                 EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
                 stringsAsFactors = FALSE)

I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start. 我尝试通过使用apply和自定义函数来解决此问题,该函数计算结束和开始的差值。

Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string. 我不确定您想要的输出是什么样子,但是在以下示例的功能中,开始和结束之间的所有月份都粘贴在字符串中。


# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
               EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)

# create month-name month nummeric value mapping
month_names = month.abb[1:12]

month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
  start = dates[1] # first unit of the vector is expected to be the start date
  end = dates[2] # second unit is expected to be the end date

  start_month = month(start)
  end_month = month(end) 
  start_year = year(start) 
  end_year = year(end)
  year_dif = end_year - start_year

  if(year_dif == 0){ #if start and end both are in the same year month is start till end
    return(paste(month_names[start_month:end_month], collapse= ", " ))
  } else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
          rep(month_names, year_dif-1),
          month_names[1:end_month]), collapse = ", ")

apply(a[2:3], 1, month_dif) 

output: 输出:

> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"                                                                 
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"          
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"

You can use a combination of apply and do.call : 您可以结合使用applydo.call

out_apply_list <- apply(X=a, MARGIN=1,
                    FUN=function(x) {
                      data.frame(id= x[1], 
                                 date=seq(from = as.Date(x[2], "%Y-%d-%m"), 
                                          to = as.Date(x[3], "%Y-%d-%m"), 
                                          by = "month"),
                                 row.names = NULL) 

df <- do.call(what = rbind, args = out_apply_list)

which gives you the following output: 这将为您提供以下输出:

> df
   id       date
1   a 2018-01-01
2   a 2018-02-01
3   a 2018-03-01
4   b 2018-05-01
5   b 2018-06-01
6   b 2018-07-01
7   b 2018-08-01
8   b 2018-09-01
9   c 2018-02-11
10  c 2018-03-11
11  c 2018-04-11
12  c 2018-05-11
13  c 2018-06-11
14  c 2018-07-11

For the sake of completeness, here is a concise one-line with data.table : 为了完整起见,以下是data.table的简明一行:

setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]
  employeeid StartDate 1: a 2018-01-01 2: a 2018-02-01 3: a 2018-03-01 4: b 2018-05-01 5: b 2018-06-01 6: b 2018-07-01 7: b 2018-08-01 8: b 2018-09-01 9: c 2018-02-11 10: c 2018-03-11 11: c 2018-04-11 12: c 2018-05-11 13: c 2018-06-11 14: c 2018-07-11 

