简体   繁体   English

加速在python中读取非常大的netcdf文件

[英]Speeding up reading of very large netcdf file in python

I have a very large netCDF file that I am reading using netCDF4 in python我有一个非常大的 netCDF 文件,我正在 python 中使用 netCDF4 读取它

I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once.我无法一次读取这个文件,因为它的尺寸 (1200 x 720 x 1440) 太大,整个文件无法一次在内存中。 The 1st dimension represents time, and the next 2 represent latitude and longitude respectively.第一个维度代表时间,接下来的两个维度分别代表纬度和经度。

import netCDF4 
nc_file = netCDF4.Dataset(path_file, 'r', format='NETCDF4')
for yr in years:
    nc_file.variables[variable_name][int(yr), :, :]

However, reading one year at a time is excruciatingly slow.然而,一次阅读一年是极其缓慢的。 How do I speed this up for the use cases below?我如何加快以下用例的速度?

--EDIT - 编辑

The chunksize is 1块大小为 1

  1. I can read a range of years: nc_file.variables[variable_name][0:100, :, :]我可以读取一系列年份:nc_file.variables[variable_name][0:100, :, :]

  2. There are several use-cases:有几个用例:

    for yr in years:年:

     numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :])

# Multiply each year by a 2D array of shape (720 x 1440)
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] * arr_2d)

# Add 2 netcdf files together 
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] + 
                 nc_file2.variables[variable_name][int(yr), :, :])

I highly recommend that you take a look at the xarray and dask projects.我强烈建议您查看xarraydask项目。 Using these powerful tools will allow you to easily split up the computation in chunks.使用这些强大的工具,您可以轻松地将计算拆分成块。 This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance.这带来了两个优势:您可以计算不适合内存的数据,并且您可以使用机器中的所有内核以获得更好的性能。 You can optimize the performance by appropriately choosing the chunk size (see documentation ). You can optimize the performance by appropriately choosing the chunk size (see documentation ).

You can load your data from netCDF by doing something as simple as您可以通过执行以下简单操作来从 netCDF 加载数据

import xarray as xr
ds = xr.open_dataset(path_file)

If you want to chunk your data in years along the time dimension, then you specify the chunks parameter (assuming that the year coordinate is named 'year'):如果要沿时间维度以年为单位对数据进行分块,则指定chunks参数(假设年份坐标名为“year”):

ds = xr.open_dataset(path_file, chunks={'year': 10})

Since the other coordinates do not appear in the chunks dict, then a single chunk will be used for them.由于其他坐标未出现在chunks字典中,因此将使用单个块作为它们。 (See more details in the documentation here .). (请参阅此处的文档中的更多详细信息。)。 This will be useful for your first requirement, where you want to multiply each year by a 2D array.这对于您的第一个要求很有用,您希望每年乘以一个二维数组。 You would simply do:你会简单地做:

ds['new_var'] = ds['var_name'] * arr_2d

Now, xarray and dask are computing your result lazily .现在, xarraydask正在懒惰地计算您的结果。 In order to trigger the actual computation, you can simply ask xarray to save your result back to netCDF:为了触发实际计算,您可以简单地要求xarray将您的结果保存回 netCDF:

ds.to_netcdf(new_file)

The computation gets triggered through dask , which takes care of splitting the processing out in chunks and thus enables working with data that does not fit in memory.计算通过dask触发,它负责将处理分成块,从而可以处理不适合内存的数据。 In addition, dask will take care of using all your processor cores for computing chunks.此外, dask将负责使用所有处理器内核来计算块。

The xarray and dask projects still don't handle nicely situations where chunks do not "align" well for parallel computation. xarraydask项目仍然不能很好地处理块不能很好地“对齐”以进行并行计算的情况。 Since in this case we chunked only in the 'year' dimension, we expect to have no issues.由于在这种情况下我们仅在“年份”维度中进行了分块,因此我们预计不会有任何问题。

If you want to add two different netCDF files together, it is as simple as:如果您想将两个不同的 netCDF 文件添加到一起,就很简单:

ds1 = xr.open_dataset(path_file1, chunks={'year': 10})
ds2 = xr.open_dataset(path_file2, chunks={'year': 10})
(ds1 + ds2).to_netcdf(new_file)

I have provided a fully working example using a dataset available online .我提供了一个使用在线可用数据集的完整示例。

In [1]:

import xarray as xr
import numpy as np

# Load sample data and strip out most of it:
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks = {'time': 4})
ds.attrs = {}
ds = ds[['latitude', 'longitude', 'time', 'tcw']]
ds

Out[1]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...

In [2]:

arr2d = np.ones((73, 144)) * 3.
arr2d.shape

Out[2]:

(73, 144)

In [3]:

myds = ds
myds['new_var'] = ds['tcw'] * arr2d

In [4]:

myds

Out[4]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [5]:

myds.to_netcdf('myds.nc')
xr.open_dataset('myds.nc')

Out[5]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [6]:

(myds + myds).to_netcdf('myds2.nc')
xr.open_dataset('myds2.nc')

Out[6]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
Data variables:
    tcw        (time, latitude, longitude) float64 20.31 20.31 20.31 20.31 ...
    new_var    (time, latitude, longitude) float64 60.92 60.92 60.92 60.92 ...

Check chunking of file.检查文件的分块。 ncdump -s <infile> will give the answer. ncdump -s <infile>会给出答案。 If chunk size in time dimension is larger than one, You should read the same amount of years at once, otherwise You are reading several years at once from disk and using only one at a time.如果时间维度中的块大小大于 1,您应该一次读取相同数量的年份,否则您将一次从磁盘读取几年并且一次只使用一个。 How slow is slow?慢有多慢? Max few seconds per timestep sounds reasonable for an array of this size.对于这种大小的数组,每个时间步长最多几秒听起来很合理。 Giving more info on what You do with the data later may give us more guidance on where the problem may be.稍后提供有关您如何处理数据的更多信息可能会为我们提供有关问题所在的更多指导。

This is Kinda Hacky, but may be the simplest solution:这有点像 Hacky,但可能是最简单的解决方案:

Read subsets of the file into memory, then cPickle ( https://docs.python.org/3/library/pickle.html ) the file back to disk for future use.将文件的子集读入内存,然后 cPickle ( https://docs.python.org/3/library/pickle.html ) 将文件放回磁盘以备将来使用。 Loading your data from a pickled data structure is likely to be faster than parsing netCDF every time.从腌制数据结构加载数据可能比每次解析 netCDF 都快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM