Xarray 最有效的方法是 select 变量并计算其平均值

Question

I have a datacube of 3Gb opened with xarray that has 3 variables I'm interested in (v, vx, vy).我有一个用 xarray 打开的 3Gb 数据立方体，其中有 3 个我感兴趣的变量（v、vx、vy）。 The description is below with the code.描述如下和代码。

I am interested only in one specific time window spanning between 2009 and 2013, while the entire dataset spans from 1984 to 2018.我只对从 2009 年到 2013 年的一个特定时间 window 感兴趣，而整个数据集从 1984 年到 2018 年。

What I want to do is:我想做的是：

Grab the v, vx, vy values between 2009 and 2013获取 2009 年到 2013 年之间的 v、vx、vy 值
Calculate their mean along the time axis and save them as three 334x333 arrays沿时间轴计算它们的平均值并将它们保存为三个 334x333 arrays

The issue is that it takes so much time that after 1 hour, the few lines of code I wrote were still running.问题是它花费了太多时间，以至于 1 小时后，我写的几行代码仍在运行。 What I don't understand is that if I save my "v" values as an array, load them as such and calculate their mean, it takes way less time than doing what I wrote below (see code).我不明白的是，如果我将“v”值保存为数组，照此加载并计算它们的平均值，那么它所花费的时间比我在下面写的要少得多（参见代码）。 I don't know if there is a memory leak, or if it is just a terrible way of doing it.我不知道是否存在 memory 泄漏，或者这只是一种糟糕的做法。 My pc has 16Gb of RAM, of which 60% is available before loading the datacube.我的电脑有 16Gb 的 RAM，其中 60% 在加载数据立方体之前可用。 So theoritically it should have enough RAM to compute everything.所以理论上它应该有足够的内存来计算一切。

What would be an efficient way to truncate my datacube to the desired time-window, then calculate the temporal mean (over axis 0) of the 3 variables "v", "vx", "vy"?将我的数据立方体截断到所需时间窗口的有效方法是什么，然后计算 3 个变量“v”、“vx”、“vy”的时间平均值（在轴 0 上）？

I tried doing it like that:我试着这样做：

datacube = xr.open_dataset('datacube.nc')  # Load the datacube
datacube = datacube.reindex(mid_date = sorted(datacube.mid_date.values))  # Sort the datacube by ascending time, where "mid_date" is the time dimension
    
sdate = '2009-01'   # Start date
edate = '2013-12'   # End date
    
ds = datacube.sel(mid_date = slice(sdate, edate))   # Create a new datacube gathering only the values between the start and end dates
    
vvtot = np.nanmean(ds.v.values, axis=0)   # Calculate the mean of the values of the "v" variable of the new datacube
vxtot = np.nanmean(ds.vx.values, axis=0)
vytot = np.nanmean(ds.vy.values, axis=0)






Dimensions:                    (mid_date: 18206, y: 334, x: 333)
Coordinates:
  * mid_date                   (mid_date) datetime64[ns] 1984-06-10T00:00:00....
  * x                          (x) float64 4.868e+05 4.871e+05 ... 5.665e+05
  * y                          (y) float64 6.696e+06 6.696e+06 ... 6.616e+06
Data variables: (12/43)
    UTM_Projection             object ...
    acquisition_img1           (mid_date) datetime64[ns] ...
    acquisition_img2           (mid_date) datetime64[ns] ...
    autoRIFT_software_version  (mid_date) float64 ...
    chip_size_height           (mid_date, y, x) float32 ...
    chip_size_width            (mid_date, y, x) float32 ...
                        ...
    vy                         (mid_date, y, x) float32 ...
    vy_error                   (mid_date) float32 ...
    vy_stable_shift            (mid_date) float64 ...
    vyp                        (mid_date, y, x) float64 ...
    vyp_error                  (mid_date) float64 ...
    vyp_stable_shift           (mid_date) float64 ...
Attributes:
    GDAL_AREA_OR_POINT:         Area
    datacube_software_version:  1.0
    date_created:               30-01-2021 20:49:16
    date_updated:               30-01-2021 20:49:16
    projection:                 32607

Answer 1

Try to avoid calling ".values" in between, because when you do that you are switching to np.array instead of xr.DataArray !尽量避免在两者之间调用“.values”，因为当你这样做时，你会切换到np.array而不是xr.DataArray ！

import xarray as xr
from dask.diagnostics import ProgressBar

# Open the dataset using chunks.
ds = xr.open_dataset(r"/path/to/you/data/test.nc", chunks = "auto")

# Select the period you want to have the mean for. 
ds = ds.sel(time = slice(sdate, edate))

# Calculate the mean for all the variables in your ds.
ds = ds.mean(dim = "time")

# The above code takes less than a second, because no actual
# calculations have been done yet (and no data has been loaded into your RAM).
# Once you use ".values", ".compute()", or
# ".to_netcdf()" they will be done. We can see progress like this:
with ProgressBar():
    ds = ds.compute()

Xarray 最有效的方法是 select 变量并计算其平均值

问题描述

1 个解决方案

解决方案1
0 2022-01-21 16:20:03

Xarray 最有效的方法是 select 变量并计算其平均值

问题描述

1 个解决方案

解决方案1 0 2022-01-21 16:20:03

解决方案1
0 2022-01-21 16:20:03