简体   繁体   中英

Converting Month, Day and no Year into a Time Series

I have a medium size data set, and here is an example taken from the data set:

2011.2012
9/7 
11/5
12/15
1/5
2/5

I'd like to convert this data into a time series format.

After converting them into characters from factors, I used the as.Dates function, but I encountered a glitch.

The results assume the missing year is the current year. My goal is to be able to convert the dates before 1/1 into year 2011, and those after 1/1 into year 2012. The data ranges between September 2011 and April 2012.

I've tried using origin and start, but to no avail. Here are my codes:

date1 <- as.character(2011.2012)
date1 <- as.Date(date1, format="%m/%d") 

How about splitting on June/July? This depends on the format of your dates.

> x=c("9/7", "11/5", "12/15", "1/5", "2/5" )
> sapply(strsplit(x, '/')
         , function(x) paste(if(as.numeric(x[1]) > 6) 2011 else 2012, x[1], x[2]
                             , sep="/"
                            )
        )

[1] "2011/9/7"   "2011/11/5"  "2011/12/15" "2012/1/5"   "2012/2/5"  

Here is a vectorized approach to the above, which uses ifelse rather than if :

mm <- matrix(nrow=2, unlist(strsplit(x, '/')))
paste(ifelse(as.numeric(mm[1,]) > 6, 2011, 2012), mm[1,], mm[2,], sep='/')

[1] "2011/9/7"   "2011/11/5"  "2011/12/15" "2012/1/5"   "2012/2/5"  

The vectorized approach is less readable, but is much faster (1.7x).

Here is one way which you could vectorize @MarkMiller's approach, using the Date functions, and looking for a regression in the month:

initialYear <- 2011

dd <- as.Date(x, "%m/%d")
mon <- format(dd, "%m")
as.Date(paste(initialYear + c(0, cumsum(diff(as.numeric(mon))<0))
                , mon
                , format(dd, "%d")
                , sep="-"
                )
          )

[1] "2011-09-07" "2011-11-05" "2011-12-15" "2012-01-05" "2012-02-05"

Probably because of all the Date functions, this takes 3.6x longer to run than the vectorized approach above (2.6x if the final as.Date is removed), which is limited to 2011 and 2012. I did not measure Mark's code, but it is probably faster than the three sapply 's and an explicit for loop.

Here is what I came up with. I do not know that this code will always work, but it seems to work with the example data set I used. The code seems to handle >2 years and any day of the year.

The code cannot handle a year for which there are no data, but if year is not in the data set then such a gap probably could not be identified regardless.

Note also that this approach will fail with the following two dates: "1/30" and "3/1", if the two dates are from two consecutive years. That is because there is such a long gap between the two dates that there is no way for the computer to realize the two dates do not come from the same year.

In other words, if there are very long gaps between two consecutive dates any approach is likely to fail without additional information. If there is, for example, at least one date from every quarter or half year then I think both posted answers will work because the computer will be able to identify a decrease in consecutive months as indicating a new year.

Maybe both approaches will work if the longest gap between two consecutive dates is 11 months. Maybe a gap of 363 days would be okay if the code was modified to also check the day of the month for each of two consecutive dates.

# specify the initial year and create dates from the data

initial.year = 2010

date  <- c("12/30", "1/1", "6/1", "6/1", "10/25", "11/27", "12/28", 
           "1/16", "2/17", "2/17", "2/17")

DDD3  <- as.Date(date, format="%m/%d")

# deconstruct dates into month, day and erroneous year

dtstr <- as.character(DDD3)
month <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 2)))
day   <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 3)))
year  <- as.numeric(as.character(sapply(strsplit(dtstr, "-") , "[", 1)))
DDD4  <- data.frame(month, day, year)

# obtain correct year for each date

year2=rep(NA, nrow(DDD4))
year2[1] = initial.year

for(i in 2:length(year2)) { 

    if(DDD4[i,1] <  DDD4[(i-1),1]) (year2[i] = year2[(i-1)]+1)
    if(DDD4[i,1] >= DDD4[(i-1),1]) (year2[i] = year2[(i-1)])

}

# create new dates using correct year

day2 <- sprintf("%02d", day)
month2 <- sprintf("%02d", month)
year2 <- as.character(year2)

DDD5 <- data.frame(month2, day2, year2)
DDD6 <- paste(DDD5[,1], DDD5[,2], DDD5[,3], sep='/')
DDD7 <- as.Date(DDD6, "%m/%d/%Y")
DDD7

# [1] "2010-12-30" "2011-01-01" "2011-06-01" "2011-06-01" 
#     "2011-10-25" "2011-11-27" "2011-12-28" "2012-01-16"
#     "2012-02-17" "2012-02-17" "2012-02-17"

I'm posting this which can be called a vectorized approach, unlike that of those offered so far. I think ifelse is pseudo vectorization, since three vectors need to be constructed and then selected.

 dat <- read.table(text="2011.2012
 9/7 
 11/5
 12/15
 1/5
 2/5", header=TRUE)

dat$date1 <- as.Date(dat$X2011.2012, format="%m/%d") 
dat$GT <- c(FALSE, diff(dat$date1) < 0)
startyr <- cumsum( as.numeric( substr(names(dat)[1], 2,5) ) )
dat$truedate <- paste( format(dat$date1, format="%m/%d") , 
                      dat$GT+startyr, sep="-")  
 dat
#-------------------------
  X2011.2012      date1 GT   truedate
1        9/7 2012-09-07  0 09/07-2011
2       11/5 2012-11-05  0 11/05-2011
3      12/15 2012-12-15  0 12/15-2011
4        1/5 2012-01-05  1 01/05-2012
5        2/5 2012-02-05  1 02/05-2012

I thought that reading the first line as a header to be perfectly legit, but if needed, an alternative that used the cumsum of the diff.Date operation should still be 'vectorized'. This would not be restricted to only two years and should be successful if there were even one date per year.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM