What's the most efficient way to use movingFun in large rasters time series?

Question

I have to smooth a large time series and I'm using the movingFun function from the 'raster' package. I tested few options based on previous posts (see my options below). The first 2 work, but are very slow when using the real data (all MOD13Q1 time series for all of Australia). So I attempted option 3 and failed. I'd appreciate if someone could help to point what's wrong in that function. I have access to memory, I'm using an RStudio Server that has 700GB ram, still, I'm not sure what'd be the best approach to do this job. Thanks in advance.

a) using movingFun and overlay

library(raster)
r <- raster(ncol=10, nrow=10)
r[] <- runif(ncell(r))
s <- brick(r,r*r,r+2,r^5,r*3,r*5)
ptm <- proc.time()
v <- overlay(s, fun=function(x) movingFun(x, fun=mean, n=3, na.rm=TRUE, circular=TRUE)) #works
proc.time() - ptm

   user  system elapsed 
  0.140   0.016   0.982

b) creating a function and using clusterR. I thought this would be faster than (a).

fun1 = function(x) {overlay(x, fun=function(x) movingFun(x, fun=mean, n=6, na.rm=TRUE, circular=TRUE))}

beginCluster(4)
ptm <- proc.time()
v = clusterR(s, fun1, progress = "text")
proc.time() - ptm
endCluster()
   user  system elapsed 
  0.124   0.012   4.069

c) I found this document written by Robert J. Hijmans and I tried (and failed) to write a function as described in the vignettes. I can't fully follow all the steps in that function, that's why is failing.

smooth.fun <- function(x, filename='', smooth_n ='',...) { #x could be a stack or list of rasters
  out <- brick(x)
  big <- ! canProcessInMemory(out, 3)
  filename <- trim(filename)
  if (big & filename == '') {
    filename <- rasterTmpFile()
  }
  if (filename != '') {
    out <- writeStart(out, filename, ...)
    todisk <- TRUE
  } else {
    vv <- matrix(ncol=nrow(out), nrow=ncol(out))
    todisk <- FALSE
  }

  bs <- blockSize(out)
  pb <- pbCreate(bs$n)

  if (todisk) {
    for (i in 1:bs$n) {
      v <- getValues(out, row=bs$row[i], nrows=bs$nrows[i] )
      v <- movingFun(v, fun=mean, n=smooth_n, na.rm=TRUE, circular=TRUE)
      out <- writeValues(out, v, bs$row[i])
      pbStep(pb, i)
    }
    out <- writeStop(out)
  } else {
    for (i in 1:bs$n) {
      v <- getValues(out, row=bs$row[i], nrows=bs$nrows[i] )
      v <- movingFun(v, fun=mean, n=smooth_n, na.rm=TRUE, circular=TRUE)
      cols <- bs$row[i]:(bs$row[i]+bs$nrows[i]-1)
      vv[,cols] <- matrix(v, nrow=out@ncols)
      pbStep(pb, i)
    }
    out <- setValues(out, as.vector(vv))
  }
  pbClose(pb)
  return(out)
}

s <- smooth.fun(s, filename='test.tif', smooth_n = 6, format='GTiff', overwrite=TRUE)

 Error in .local(.Object, ...) : 
  `/path-to-dir/test.tif' does not exist in the file system,
and is not recognised as a supported dataset name.

Answer 1

This is the solution I found, thanks to my colleague. It computes each year (of 23 files) in 20 minutes. There may be things to improve, but at this stage, I'm happy I can do the job in only 20 min per year.

So here I run 5 years simultaneously using foreach package. Then the for loop creates an array with 6 files at the time; remember that I needed a 3-months-moving-window, in the MOD13Q1 16-days dataset, that's 6 files. Then the loop calculates mean values on the array using ColMeans , creates an empty raster, assigns the mean values to the new raster and saves it. Note that we recreated the circular option of the movingFun function. So, the 1st date's mean is done based on the last dates of that same year.

require(raster)
require(rgdal)
library(foreach)
library(doParallel)

rasterOptions(maxmemory = 3e10, chunksize = 2e10)

ip <- "directory-with-grids"
op <- "directory-where-resuls-are-being-saved"

years = c(2000:2017)   

k <- 6    # moving window size
k2 <- floor((k-1)/2)
slot <- 0

# determine clusters
cl <- makeCluster(5, outfile = "") # make worker prints visible
registerDoParallel(cl)

foreach(j = 1:length(years), .packages=c("raster")) %dopar% {
  ip1 = paste(ip, years[j],sep='/')
  ndvi.files <- list.files(ip1, pattern = 'ndvi.*tif$',full.names = T) 
  nfiles <- length(ndvi.files)

  for (n in (1-(k-1)):nfiles) {
    i <- (n + k2 - 1) %% nfiles + 1
    print(ndvi.files[i])
    r <- raster(ndvi.files[i])
    if (slot == 0) {
      win <- matrix(data = NA, nrow = k, ncol = r@nrows * r@ncols)
    }
    slot <- slot %% k + 1
    win[slot,] <- getValues(r)
    if (n > 0) {
      o <- raster(extent(c(xx,xx,xx ,xx))); res(o)=c(xx,xx) # your extent and resolution
      crs(o) <-'+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0'
      o[] <- colMeans(win)
      o[o<0] <- NA
      # write out m as the nth raster
      fname = paste(names(r),'smoothed',sep='_')
      out.dir =  file.path(op, paste(years[j], sep='/'))
      dir.create(out.dir,showWarnings = FALSE)
      out.path = file.path(out.dir, fname)
      writeRaster(o, out.path, format="Geotiff", overwrite=T,  datatype='INT2S')
    }
  }
}

stopCluster(cl)

What's the most efficient way to use movingFun in large rasters time series?

Question

1 answers

solution1
1 2018-06-08 04:06:17

What's the most efficient way to use movingFun in large rasters time series?

Question

1 answers

solution1 1 2018-06-08 04:06:17

solution1
1 2018-06-08 04:06:17