简体   繁体   English

从多个大型 NetCDF 文件中提取数据的快速/高效方法

[英]Fast/efficient way to extract data from multiple large NetCDF files

I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000).我只需要为一组特定的节点从全局网格中提取数据,这些节点由纬度/经度坐标(按 5000-10000 的顺序)给出。 The data are time-series of hydraulic parameters, for example wave height.数据是水力参数的时间序列,例如波高。

The global data set is huge so it is divided into many NetCDF files.全局数据集非常庞大,因此分为许多 NetCDF 文件。 Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (eg wave height) and one year (eg 2020).每个 NetCDF 文件大约 5GB,包含整个全球网格的数据,但仅针对一个变量(例如波高)和一年(例如 2020)。 Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.假设我想在某个位置提取 6 个变量的完整时间序列(42 年),我需要从 6x42 = 252 个 NC 文件中提取数据,每个 5GB 大小。

My current approach is a triple loop through years, variables, and nodes.我目前的方法是通过年、变量和节点进行三重循环。 I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary.我使用 Xarray 打开每个 NC 文件,提取所有所需节点的数据并将其存储在字典中。 Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file.提取字典中的所有数据后,我会为每个位置创建一个 pd.dataframe,并将其存储为 pickle 文件。 With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).有 6 个变量和 42 年,这导致每个位置的泡菜文件大约 7-9 MB(实际上不是很大)。

My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long.如果我有少量位置,我的方法非常有效,但是一旦它增长到几百个,这种方法就需要很长时间。 My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted).我的直觉是这是一个 memory 问题(因为所有提取的数据首先存储在一个字典中,直到提取每年和变量)。 But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.但是我的一位同事说,Xarray 实际上效率很低,这可能会导致持续时间过长。

Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files?这里有没有人有类似问题的经验或知道从大量 NC 文件中提取数据的有效方法? I put the code I currently use below.我把我目前使用的代码放在下面。 Thanks for any help!谢谢你的帮助!

# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1)   # year range
ndata = {}                      # dictionary which will contain all data

# loop through all the desired variables
for v in vars.keys():
    ndata[v] = {}

    # For each variable, loop through each year, open the nc file and extract the data
    for y in years:
        
        # Open file with xarray
        fname = 'xxx.nc'
        data = xr.open_dataset(fname)
        
        # loop through the locations and load the data for each node as temp
        for n in range(len(nodes)):
            node = nodes.node_id.iloc[n]
            lon = nodes.lon.iloc[n]
            lat = nodes.lat.iloc[n]    
            
            temp = data.sel(longitude=lon, latitude=lat)
            
            # For the first year, store the data into the ndata dict
            if y == years[0]:
                ndata[v][node] = temp
            # For subsequent years, concatenate the existing array in ndata
            else:
                ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

# merge the variables for the current location into one dataset
for n in range(len(nodes)):
    node = nodes.node_id.iloc[n]
    
    dset = xr.merge(ndata[v][node] for v in variables.keys())
    df = dset.to_dataframe()

    # save dataframe as pickle file, named by the node id
    df.to_pickle('%s.xz'%(node)))

This is a pretty common workflow so I'll give a few pointers.这是一个非常常见的工作流程,所以我将给出一些指示。 A few suggested changes, with the most important ones first一些建议的更改,首先是最重要的更改

  1. Use xarray's advanced indexing to select all points at once一次使用 xarray 对 select 所有点的高级索引

    It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id' .看起来您正在使用带有'lat', 'lon', and 'node_id'列的 pandas DataFrame nodes As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C.与 python 中的几乎所有内容一样,尽可能删除内部 for 循环,利用 C 中编写的基于数组的操作。 In this case:在这种情况下:

     # create an xr.Dataset indexed by node_id with arrays `lat` and `lon node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray() # select all points from each file simultaneously, reshaping to be # indexed by `node_id` node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon) # dump this reshaped data to pandas, with each variable becoming a column node_df = node_data.to_dataframe()
  2. Only reshape arrays once仅重塑 arrays 一次

    In your code, you are looping over many years, and every year after the first one you are allocating a new array with enough memory to hold as many years as you've stored so far.在您的代码中,您循环了很多年,并且在第一个代码之后的每一年,您都分配一个具有足够 memory 的新数组来保存到目前为止存储的年限。

     # For the first year, store the data into the ndata dict if y == years[0]: ndata[v][node] = temp # For subsequent years, concatenate the existing array in ndata else: ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

    Instead, just gather all the years worth of data and concatenate them at the end.相反,只需收集所有年份的数据并在最后将它们连接起来。 This will only allocate the needed array for all the data once.这只会为所有数据分配一次所需的数组。

  3. Use dask , eg with xr.open_mfdataset to leverage multiple cores.使用dask ,例如使用xr.open_mfdataset来利用多个内核。 If you do this, you may want to consider using a format that supports multithreaded writes, eg zarr如果您这样做,您可能需要考虑使用支持多线程写入的格式,例如zarr

All together, this could look something like this:总之,这可能看起来像这样:

# build nested filepaths
filepaths = [
    ['xxx.nc'.format(year=y, variable=v) for y in years
    for v in variables
]

# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()

# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
    filepaths,
    combine='nested',
    concat_dim=['variable', 'year'],
    parallel=True,
)

# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)

# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')

# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 NETCDF 文件中提取数据的有效方法 - Efficient way to extract data from NETCDF files 使用 Xarray 从 netCDF 文件中提取数据到高 DataFrame 的有效方法 - Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame 以高效快速的方式解析大型 XML 文件并在 Python 中提取嵌套元素 - Parse large XML files and extract nested elements in Python in efficient and fast way 如何从存储在 Azure 文件共享中的大型 NetCDF 文件中提取数据并发送到 Azure Web 页 - How to extract data from large NetCDF files stored on Azure File shares and send to Azure Web Page 有没有一种真正有效(快速)的方式来读取 python 中的大文本文件? - Is there a really efficient (FAST) way to read large text files in python? 从不同的大文件中打乱数据的有效方法 - Efficient way to shuffle data from different large files python:从多个netCDF文件中绘制数据 - python: Plotting data from multiple netCDF files 是否有从 netCDF4 文件中提取文件路径的特定方法? - Is there a specific way to extract file paths from netCDF4 files? Python 使用 xarray 从 NETCDF 文件中提取多个纬度/经度 - Python extract multiple lat/long from NETCDF files using xarray 从大型Pandas DataFrame中删除行的快速有效方法 - Fast, efficient way to remove rows from large Pandas DataFrame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM