简体   繁体   中英

Fast/efficient way to extract data from multiple large NetCDF files

I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000). The data are time-series of hydraulic parameters, for example wave height.

The global data set is huge so it is divided into many NetCDF files. Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (eg wave height) and one year (eg 2020). Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.

My current approach is a triple loop through years, variables, and nodes. I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary. Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file. With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).

My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long. My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted). But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.

Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files? I put the code I currently use below. Thanks for any help!

# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1)   # year range
ndata = {}                      # dictionary which will contain all data

# loop through all the desired variables
for v in vars.keys():
    ndata[v] = {}

    # For each variable, loop through each year, open the nc file and extract the data
    for y in years:
        
        # Open file with xarray
        fname = 'xxx.nc'
        data = xr.open_dataset(fname)
        
        # loop through the locations and load the data for each node as temp
        for n in range(len(nodes)):
            node = nodes.node_id.iloc[n]
            lon = nodes.lon.iloc[n]
            lat = nodes.lat.iloc[n]    
            
            temp = data.sel(longitude=lon, latitude=lat)
            
            # For the first year, store the data into the ndata dict
            if y == years[0]:
                ndata[v][node] = temp
            # For subsequent years, concatenate the existing array in ndata
            else:
                ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

# merge the variables for the current location into one dataset
for n in range(len(nodes)):
    node = nodes.node_id.iloc[n]
    
    dset = xr.merge(ndata[v][node] for v in variables.keys())
    df = dset.to_dataframe()

    # save dataframe as pickle file, named by the node id
    df.to_pickle('%s.xz'%(node)))

This is a pretty common workflow so I'll give a few pointers. A few suggested changes, with the most important ones first

  1. Use xarray's advanced indexing to select all points at once

    It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id' . As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C. In this case:

     # create an xr.Dataset indexed by node_id with arrays `lat` and `lon node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray() # select all points from each file simultaneously, reshaping to be # indexed by `node_id` node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon) # dump this reshaped data to pandas, with each variable becoming a column node_df = node_data.to_dataframe()
  2. Only reshape arrays once

    In your code, you are looping over many years, and every year after the first one you are allocating a new array with enough memory to hold as many years as you've stored so far.

     # For the first year, store the data into the ndata dict if y == years[0]: ndata[v][node] = temp # For subsequent years, concatenate the existing array in ndata else: ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

    Instead, just gather all the years worth of data and concatenate them at the end. This will only allocate the needed array for all the data once.

  3. Use dask , eg with xr.open_mfdataset to leverage multiple cores. If you do this, you may want to consider using a format that supports multithreaded writes, eg zarr

All together, this could look something like this:

# build nested filepaths
filepaths = [
    ['xxx.nc'.format(year=y, variable=v) for y in years
    for v in variables
]

# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()

# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
    filepaths,
    combine='nested',
    concat_dim=['variable', 'year'],
    parallel=True,
)

# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)

# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')

# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM