[英]Fast/efficient way to extract data from multiple large NetCDF files
I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000).我只需要为一组特定的节点从全局网格中提取数据,这些节点由纬度/经度坐标(按 5000-10000 的顺序)给出。 The data are time-series of hydraulic parameters, for example wave height.
数据是水力参数的时间序列,例如波高。
The global data set is huge so it is divided into many NetCDF files.全局数据集非常庞大,因此分为许多 NetCDF 文件。 Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (eg wave height) and one year (eg 2020).
每个 NetCDF 文件大约 5GB,包含整个全球网格的数据,但仅针对一个变量(例如波高)和一年(例如 2020)。 Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.
假设我想在某个位置提取 6 个变量的完整时间序列(42 年),我需要从 6x42 = 252 个 NC 文件中提取数据,每个 5GB 大小。
My current approach is a triple loop through years, variables, and nodes.我目前的方法是通过年、变量和节点进行三重循环。 I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary.
我使用 Xarray 打开每个 NC 文件,提取所有所需节点的数据并将其存储在字典中。 Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file.
提取字典中的所有数据后,我会为每个位置创建一个 pd.dataframe,并将其存储为 pickle 文件。 With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).
有 6 个变量和 42 年,这导致每个位置的泡菜文件大约 7-9 MB(实际上不是很大)。
My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long.如果我有少量位置,我的方法非常有效,但是一旦它增长到几百个,这种方法就需要很长时间。 My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted).
我的直觉是这是一个 memory 问题(因为所有提取的数据首先存储在一个字典中,直到提取每年和变量)。 But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.
但是我的一位同事说,Xarray 实际上效率很低,这可能会导致持续时间过长。
Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files?这里有没有人有类似问题的经验或知道从大量 NC 文件中提取数据的有效方法? I put the code I currently use below.
我把我目前使用的代码放在下面。 Thanks for any help!
谢谢你的帮助!
# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1) # year range
ndata = {} # dictionary which will contain all data
# loop through all the desired variables
for v in vars.keys():
ndata[v] = {}
# For each variable, loop through each year, open the nc file and extract the data
for y in years:
# Open file with xarray
fname = 'xxx.nc'
data = xr.open_dataset(fname)
# loop through the locations and load the data for each node as temp
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
lon = nodes.lon.iloc[n]
lat = nodes.lat.iloc[n]
temp = data.sel(longitude=lon, latitude=lat)
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
# merge the variables for the current location into one dataset
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
dset = xr.merge(ndata[v][node] for v in variables.keys())
df = dset.to_dataframe()
# save dataframe as pickle file, named by the node id
df.to_pickle('%s.xz'%(node)))
This is a pretty common workflow so I'll give a few pointers.这是一个非常常见的工作流程,所以我将给出一些指示。 A few suggested changes, with the most important ones first
一些建议的更改,首先是最重要的更改
Use xarray's advanced indexing to select all points at once一次使用 xarray 对 select 所有点的高级索引
It looks like you're using a pandas DataFrame nodes
with columns 'lat', 'lon', and 'node_id'
.看起来您正在使用带有
'lat', 'lon', and 'node_id'
列的 pandas DataFrame nodes
。 As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C.与 python 中的几乎所有内容一样,尽可能删除内部 for 循环,利用 C 中编写的基于数组的操作。 In this case:
在这种情况下:
# create an xr.Dataset indexed by node_id with arrays `lat` and `lon node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray() # select all points from each file simultaneously, reshaping to be # indexed by `node_id` node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon) # dump this reshaped data to pandas, with each variable becoming a column node_df = node_data.to_dataframe()
Only reshape arrays once仅重塑 arrays 一次
In your code, you are looping over many years, and every year after the first one you are allocating a new array with enough memory to hold as many years as you've stored so far.在您的代码中,您循环了很多年,并且在第一个代码之后的每一年,您都分配一个具有足够 memory 的新数组来保存到目前为止存储的年限。
# For the first year, store the data into the ndata dict if y == years[0]: ndata[v][node] = temp # For subsequent years, concatenate the existing array in ndata else: ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
Instead, just gather all the years worth of data and concatenate them at the end.相反,只需收集所有年份的数据并在最后将它们连接起来。 This will only allocate the needed array for all the data once.
这只会为所有数据分配一次所需的数组。
Use dask , eg with xr.open_mfdataset
to leverage multiple cores.使用dask ,例如使用
xr.open_mfdataset
来利用多个内核。 If you do this, you may want to consider using a format that supports multithreaded writes, eg zarr
如果您这样做,您可能需要考虑使用支持多线程写入的格式,例如
zarr
All together, this could look something like this:总之,这可能看起来像这样:
# build nested filepaths
filepaths = [
['xxx.nc'.format(year=y, variable=v) for y in years
for v in variables
]
# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
filepaths,
combine='nested',
concat_dim=['variable', 'year'],
parallel=True,
)
# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')
# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.