简体   繁体   中英

R xts - Resampling unequal time step xts to equidistant time series

I'm working in R with xts time series.

What I have: A time series data set with unequally spaced time steps.

What I'd like to get: A time series with equally spaced time steps whose values correspond to the proportion of the original values overlapping the time step (see example below).

Example: With an original series like this:

sample_xts <- as.xts(read.zoo(text='
2016-07-01 00:00:20,   0.0
2016-07-01 00:01:20,  60.0
2016-07-01 00:01:50,  30.0
2016-07-01 00:02:30,  40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00,  97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09,   0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S"))
names(sample_xts) <- c('x')

I'd like to get an equally spaced time series that looks like this:

                         x
2016-07-01 00:00:00,   0.0
2016-07-01 00:01:00,  40.0
2016-07-01 00:02:00,  60.0
2016-07-01 00:03:00,  60.0
2016-07-01 00:04:00,  60.0
2016-07-01 00:05:00, 100.0
2016-07-01 00:06:00, 157.0
2016-07-01 00:07:00, 120.0
2016-07-01 00:08:00,  24.0
2016-07-01 00:09:00,   0.0

Note:

  • Some original time steps are smaller than the new time step while others are larger.
  • The colSums of x is left unchanged (ie 621).

Here is the sketch I used to create the above example (may help to illustrate what I'd like to do): 重新采样的插图

I'd like an approach that isn't limited to creating a 1 minute time step series but generally to any fixed time step.

I have looked at many q/a on stackoverflow and tried out many different things but without success.

Any help would be greatly appreciated! Thanks.

Here is some code I wrote using zoo - I haven't used xts much so I don't know if the same functions can be applied. Hope that helps!

Functions

The following function calculates, for each interval of the original data, the fraction that overlaps with a given interval (Note: In all of the following code, the variable names ta1 and ta2 refer to the start and end of a given time interval (eg each of the equal intervals that you need as output), while tb1 and tb2 refer to the start and end of the (unequal) intervals of the original data):

frac.overlap <- function(ta1,ta2,tb1,tb2){
if(tb1 <= ta1 & tb2 >= ta2) {   # Interval 2 starts earlier and ends later than interval 1
    frac <- as.numeric(difftime(ta2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if(tb1 >= ta1 & tb2 <= ta2) {    # Interval 2 is fully contained within interval 1
    frac <- 1
} else if(tb1 <= ta1 & tb2 >= ta1) {    # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
    frac <- as.numeric(difftime(tb2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if (tb1 <= ta2 & tb2 >= ta2){    # Interval 2 partly overlaps with interval 1 (starts later, ends later)
    frac <- as.numeric(difftime(ta2,tb1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
        } else {                                # No overlap
            frac <- 0
    }

    return(frac)
}

The next function determines which records of the original dataset overlap with the currently considered interval ta1 - ta2 :

check.overlap <- function(ta1,ta2,tb1,tb2){
ov <- vector("logical",4)
ov[1] <- (tb1 <= ta1 & tb2 >= ta2)  # Interval 2 starts earlier and ends later than interval 1
ov[2] <- (tb1 >= ta1 & tb2 <= ta2)  # Interval 2 is fully contained within interval 1
ov[3] <- (tb1 <= ta1 & tb2 >= ta1)  # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
ov[4] <- (tb1 <= ta2 & tb2 >= ta2)  # Interval 2 partly overlaps with interval 1 (starts later, ends later)
return(as.logical(sum(ov))) # Gives TRUE if at least one element of ov is TRUE, otherwise FALSE
}

(Note: this works well with the sample data that you provided, but on a larger dataset, I found it to be prohibitively slow. Since I wrote this code to resample time series with a regular time step, I usually use a fixed interval to complete this step, which is dramatically faster. It's probably easy enough to modify the code (see code of the next function) to speed up this step based on the intervals of the original data.)

The next function uses the previous two to calculate the resampled value for an interval ta1 - ta2 :

fracres <- function(tstart,interval,input){
# tstart: POSIX object
# interval: length of interval in seconds
# input: zoo object

ta1 <- tstart
ta2 <- tstart + interval

# First, determine which records of the original data (input) overlap with the current
# interval, to avoid going through the whole object at every iteration
ind <- index(input)
ind1 <- index(lag(input,-1))
recs <- which(sapply(1:length(ind),function(x) check.overlap(ta1,ta2,ind[x],ind1[x])))
#recs <- which(abs(as.numeric(difftime(ind,ta1,units="secs"))) < 601)


# For each record overlapping with the current interval, return the fraction of the input data interval contained in the current interval
if(length(recs) > 0){
    fracs <- sapply(1:length(recs), function(x) frac.overlap(ta1,ta2,ind[recs[x]],ind1[recs[x]]))
    return(sum(coredata(input)[recs]*fracs))

} else {
    return(0)
}
}

(The commented-out line shows how to get the relevant records if the maximum time difference between the original and new time steps is known.)

Application

First, let's read in your sample data as a zoo object:

sample_zoo <- read.zoo(text='
2016-07-01 00:00:20,   0.0
2016-07-01 00:01:20,  60.0
2016-07-01 00:01:50,  30.0
2016-07-01 00:02:30,  40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00,  97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09,   0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S")

It looks like your dataset contains instantaneous values ("at 01:20 , the value of x is 60"). Since I wrote this code for summed values, the meaning of the time stamp is different ("the record starting at 01:20 has a value of 60"). To correct for this, the records need to be shifted:

sample_zoo <- lag(sample_zoo,1)

Then, we define a sequence of POSIXct objects corresponding to the desired resolution:

time.out <- seq.POSIXt(from=as.POSIXct("2016-07-01"),to=(as.POSIXct("2016-07-01")+(60*9)),by="1 min")

We can then apply the function fracres , described above:

data.out <- sapply(1:length(time.out), function(x) fracres(tstart=time.out[x],interval=60,input=sample_zoo))

The index and data are combined to a zoo object:

zoo.out <- read.zoo(data.frame(time.out,data.out))

And finally, the time series is shifted again by one step, in the opposite direction as before:

zoo.out <- lag(zoo.out,-1)

2016-07-01 00:01:00 2016-07-01 00:02:00 2016-07-01 00:03:00 2016-07-01 00:04:00 2016-07-01 00:05:00 2016-07-01 00:06:00 2016-07-01 00:07:00 2016-07-01 00:08:00 2016-07-01 00:09:00 
             40                  60                  60                  60                 100                 157                 120                  24                   0 

I finally decided to go the "while-loop-way" with this and have created the solution below. It works well - is not super fast but execution time seems to be proportional to the length of the time series. I tested it both with the little example I posted in the question and with a source time series having 330 000 observations and a destination series of about 110 000 time steps.

Both source and destination series can have irregular time steps. The sum of the resulting series is the same as that of the source.

Performance: It is ok fast but I'm sure it could be faster. I guess it's an obvious candidate for a RCpp version which should be significantly faster for long series. For now this will do for me and if/when I get round to creating an RCpp version I'll post here.

Please post if you have suggestions for performance improvement!

Thanks!

sameEndTime <- function(i,j,src_index,dest_index){
  if(src_index[i] == dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

wholeSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(dest_index[j-1] <= src_index[i-1] & src_index[i] <= dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

wholeDestStepIsWithinSourceStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] <= dest_index[j-1]  &  dest_index[j] <= src_index[i]){
    TRUE
  } else {
    FALSE
  }
}

onlyEndOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] < dest_index[j-1]  &  src_index[i] < dest_index[j] & src_index[i] > dest_index[j-1]){
    TRUE
  } else {
    FALSE
  }
}

onlyStartOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] < dest_index[j]  &  src_index[i-1] > dest_index[j-1] & src_index[i] > dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

resampleToDestTimeSteps <- function(src, dest){
  # src and dest are both xts with only one time series each
  # src is the original series and 
  # dest holds the time steps of the final series
  #
  # NB: there is an issue with the very first time step 
  # (gets ignored in this version)
  #
  original_names <- names(src)
  names(src) <- c("value")
  names(dest) <- c("value")
  dest$value <- dest$value*0.0
  dest$value[is.na(dest$value)] <- 0.0

  dest[1]$value = 0.0

  for(k in 2:length(src)){
    src[k]$value <- src[k]$value/as.numeric(difftime(index(src[k]),index(src[k-1]),units="secs"))
  }
  # First value is NA due to lag at this point (we don't want that)
  src$value[1] = 0.0

  i = 2 # source timestep counter
  j = 2 # destination timestep counter

  src_index = index(src)
  dest_index = index(dest)

  src_length = length(src)
  dest_length = length(dest)

  # Make sure we start with an overlap
  if(src_index[2] < dest_index[1]){
    while(src_index[i] < dest_index[1]){
      i = i + 1
    }
  } else if(dest_index[2] < src_index[1]){
    while(dest_index[j] < src_index[1]){
      j = j + 1
    }
  }

  while(i <= src_length & j <= dest_length){
    if( wholeSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],src_index[i-1],units="secs"))
      if(sameEndTime(i,j,src_index,dest_index)){
        j = j+1
      }
      i = i+1
    } else if( wholeDestStepIsWithinSourceStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(dest_index[j],dest_index[j-1],units="secs"))
      if(sameEndTime(i,j,src_index,dest_index)){
        i = i+1
      }
      j = j+1
    } else if( onlyEndOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],dest_index[j-1],units="secs"))
      i = i+1
    } else if( onlyStartOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      diff_time = difftime(dest_index[j],src_index[i-1],units="secs")
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(diff_time)
      j = j+1
    } else {
      print("======================================================")
      print(paste0("i=",i,", j=",j))
      print(paste0("src_index[i]   =",src_index[i]))
      print(paste0("dest_index[j]  =",dest_index[j]))
      print(" ")
      print(paste0("src_index[i-1] =",src_index[i-1]))
      print(paste0("dest_index[j-1]=",dest_index[j-1]))
      print("======================================================")
      stop("This should never happen.")
    }
  }
  names(dest) <- original_names
  return(dest)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM