简体   繁体   中英

Change variable chunk of a Netcdf with R

Regularly I face the same problem when using R to work with big netcdf files (bigger than the computer memory). There is not an obvious way to change the chunk of the data. This is probably the only netcdf common task that I can not figure out how to do it in an efficient way in R. I used to go trough this problem using NCO or nccopy depending the situation. Even CDO has options to copy a nc changing the chunk but much less flexible than the previous tools. I am wondering if there is any efficient way to do it in R.

The following example generates a toy nc chunked as Chunking: [100,100,1]

library(ncdf4)

foo_nc_path=paste0(tempdir(),"/thing.nc")
xvals <- 1:100
yvals <- 1:100

lon <- ncdim_def("longitude", "Km_east", xvals)
lat <- ncdim_def("latitude", "Km_north", yvals)

time <- ncdim_def("Time","hours", 1:1000, unlim=TRUE)
var<- ncvar_def("foo_var", "nothing", list(lon, lat, time), chunksizes=c(100,100,1),
                      longname="xy chunked numbers", missval=-9) 

foo_nc <- nc_create(foo_nc_path, list(var))

data <- array(runif(100*100*1000),dim = c(100,100,1000))

ncvar_put(foo_nc, var, data)

nc_close(foo_nc)


####Check speed

foo_nc <- nc_open(foo_nc_path)

system.time({timestep <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(-1,-1,1))})
system.time({timeserie <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(1,1,-1))})

As you can see, the read time is much bigger for the timeserie than fot the timestep var

The time difference increase exponentially with the size of the.nc.

Does anybody know any way to change the chunk of a nc file in R, whose size is bigger than the computer memory?

It depends on you purpose. If you need to extract/analyze "map-wise" slices (ie on the lat-lon matrix), then keep the chunking strategy on the spatial coordinates. However, if you wish to run a time-wise analysis (such as extracting time series of each grid cell to calculate trends), then my advice is to switch your chunking strategy to the time dimension.

Try re-rerunning your code by replacing chunksizes=c(100,100,1) with something like, say chunksizes=c(10,10,1000) . The time series reading becomes much faster that way.

If your code is really slow in R you can try a faster alternative, such as (for example) nccopy or nco .

You can re-chunk your netcdf file using a simple nccopy command like this: nccopy -c time/1000,lat/10,lon/10 input.nc output.chunked.nc

In nco (which I recommend over nccopy for this operation), you could do something along the lines:

nco -O -4 -D 4 --cnk_plc g2d --cnk_dmn lat,10 --cnk_dmn lon,10 --cnk_dmn time,1000 in.nc out.nc

specifying --cnk_dmn to your specific variables with the chunk size of interest. More examples at http://nco.sourceforge.net/nco.html#Chunking .

Either way, you have to play around a little bit with the different chunk sizes in order to determine what works best for you specific case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM