简体   繁体   English

使用 R 更改 Netcdf 的变量块

[英]Change variable chunk of a Netcdf with R

Regularly I face the same problem when using R to work with big netcdf files (bigger than the computer memory).在使用 R 处理大型 netcdf 文件(比计算机内存大)时,我经常会遇到同样的问题。 There is not an obvious way to change the chunk of the data.没有明显的方法来更改数据块。 This is probably the only netcdf common task that I can not figure out how to do it in an efficient way in R.这可能是我无法弄清楚如何在 R 中以有效方式完成的唯一 netcdf 常见任务。 I used to go trough this problem using NCO or nccopy depending the situation.我曾经根据情况使用 NCO 或 nccopy 来解决这个问题 go 。 Even CDO has options to copy a nc changing the chunk but much less flexible than the previous tools.甚至 CDO 也可以选择复制 nc 更改块,但比以前的工具灵活得多。 I am wondering if there is any efficient way to do it in R.我想知道在 R 中是否有任何有效的方法可以做到这一点。

The following example generates a toy nc chunked as Chunking: [100,100,1]以下示例生成一个玩具 nc,分块为Chunking: [100,100,1]

library(ncdf4)

foo_nc_path=paste0(tempdir(),"/thing.nc")
xvals <- 1:100
yvals <- 1:100

lon <- ncdim_def("longitude", "Km_east", xvals)
lat <- ncdim_def("latitude", "Km_north", yvals)

time <- ncdim_def("Time","hours", 1:1000, unlim=TRUE)
var<- ncvar_def("foo_var", "nothing", list(lon, lat, time), chunksizes=c(100,100,1),
                      longname="xy chunked numbers", missval=-9) 

foo_nc <- nc_create(foo_nc_path, list(var))

data <- array(runif(100*100*1000),dim = c(100,100,1000))

ncvar_put(foo_nc, var, data)

nc_close(foo_nc)


####Check speed

foo_nc <- nc_open(foo_nc_path)

system.time({timestep <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(-1,-1,1))})
system.time({timeserie <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(1,1,-1))})

As you can see, the read time is much bigger for the timeserie than fot the timestep var如您所见, timeserie的读取时间比timestep var 的读取时间大得多

The time difference increase exponentially with the size of the.nc.时间差随着.nc 的大小呈指数增长。

Does anybody know any way to change the chunk of a nc file in R, whose size is bigger than the computer memory?有谁知道如何更改 R 中的 nc 文件块,其大小大于计算机 memory?

It depends on you purpose.这取决于你的目的。 If you need to extract/analyze "map-wise" slices (ie on the lat-lon matrix), then keep the chunking strategy on the spatial coordinates.如果您需要提取/分析“map-wise”切片(即在 lat-lon 矩阵上),则将分块策略保持在空间坐标上。 However, if you wish to run a time-wise analysis (such as extracting time series of each grid cell to calculate trends), then my advice is to switch your chunking strategy to the time dimension.但是,如果您希望进行时间分析(例如提取每个网格单元的时间序列以计算趋势),那么我的建议是将您的分块策略切换到时间维度。

Try re-rerunning your code by replacing chunksizes=c(100,100,1) with something like, say chunksizes=c(10,10,1000) .尝试通过将chunksizes=c(100,100,1)替换为类似chunksizes=c(10,10,1000) =c(10,10,1000) 的内容来重新运行代码。 The time series reading becomes much faster that way.这样,时间序列读取变得更快。

If your code is really slow in R you can try a faster alternative, such as (for example) nccopy or nco .如果您的代码在 R 中真的很慢,您可以尝试更快的替代方案,例如(例如) nccopynco

You can re-chunk your netcdf file using a simple nccopy command like this: nccopy -c time/1000,lat/10,lon/10 input.nc output.chunked.nc您可以使用如下简单的nccopy命令重新分块您的 netcdf 文件: nccopy -c time/1000,lat/10,lon/10 input.nc output.chunked.nc

In nco (which I recommend over nccopy for this operation), you could do something along the lines:nco中(我建议在nccopy上进行此操作),您可以执行以下操作:

nco -O -4 -D 4 --cnk_plc g2d --cnk_dmn lat,10 --cnk_dmn lon,10 --cnk_dmn time,1000 in.nc out.nc

specifying --cnk_dmn to your specific variables with the chunk size of interest.--cnk_dmn指定给您感兴趣的块大小的特定变量。 More examples at http://nco.sourceforge.net/nco.html#Chunking .更多示例见 http://nco.sourceforge.net/nco.html#Chunking

Either way, you have to play around a little bit with the different chunk sizes in order to determine what works best for you specific case.无论哪种方式,您都必须对不同的块大小进行一些尝试,以确定最适合您的特定情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM