如何使用 xarray 数据集实现 numpy 索引

Question

I know the x and the y indices of a 2D array (numpy indexing).我知道二维数组的 x 和 y 索引（numpy 索引）。

Following this documentation , xarray uses eg Fortran style of indexing.按照本文档，xarray 使用例如 Fortran 样式的索引。

So when I pass eg所以当我通过例如

ind_x = [1, 2]
ind_y = [3, 4]

I expect 2 values for the index pairs (1,3) and (2,4), but xarray returns a 2x2 matrix.我希望索引对 (1,3) 和 (2,4) 有 2 个值，但 xarray 返回一个 2x2 矩阵。

Now I want to know how to achieve numpy like indexing with xarray?现在我想知道如何像使用 xarray 索引一样实现 numpy？

Note: I want to avoid loading the whole data into memory.注意：我想避免将整个数据加载到 memory 中。 So using .values api is not part of the solution I am looking for.所以使用.values api 不是我正在寻找的解决方案的一部分。

Answer 1

You can access the underlying numpy array to index it directly:您可以访问底层numpy数组以直接对其进行索引：

import xarray as xr

x = xr.tutorial.load_dataset("air_temperature")

ind_x = [1, 2]
ind_y = [3, 4]

print(x.air.data[0, ind_y, ind_x].shape)
# (2,)

Edit:编辑：

Assuming you have your data in a dask -backed xarray and don't want to load all of it into memory, you need to use vindex on the dask array behind the xarray data object:假设您的数据在dask支持的xarray并且不想将所有数据加载到 memory 中，您需要在xarray数据 object 后面的dask阵列上使用vindex ：

import xarray as xr

# simple chunk to convert to dask array
x = xr.tutorial.load_dataset("air_temperature").chunk({"time":1})

extract = x.air.data.vindex[0, ind_y, ind_x]

print(extract.shape)
# (2,)

print(extract.compute())
# [267.1, 274.1], dtype=float32)

Answer 2

In order to take the speed into account I have made a test with different methods.为了考虑速度，我用不同的方法进行了测试。

def method_1(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        d = Dataset(file, 'r')
        data.append(d.variables['hrv'][indices])
        d.close()
    return data


def method_2(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_dataset(file, engine='h5netcdf').hrv.values[indices])
    return data


def method_3(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_mfdataset([file], engine='h5netcdf').hrv.data.vindex[indices].compute())
    return data

In [1]: len(file_paths)
Out[1]: 4813

The results:结果：

method_1 (using netcdf4 library): 101.9s method_1（使用 netcdf4 库）：101.9s
method_2 (using xarray and values API): 591.4s method_2（使用 xarray 和 values API）：591.4s
method_3 (using xarray+dask): 688.7s方法_3（使用 xarray+dask）：688.7s

I guess that xarray+dask takes to much time within .compute step.我猜 xarray+dask 在.compute步骤中需要很多时间。

如何使用 xarray 数据集实现 numpy 索引

问题描述

2 个解决方案

解决方案1
1 2021-04-22 07:19:19

解决方案2
0 2021-04-22 11:00:48

如何使用 xarray 数据集实现 numpy 索引

问题描述

2 个解决方案

解决方案1 1 2021-04-22 07:19:19

解决方案2 0 2021-04-22 11:00:48

解决方案1
1 2021-04-22 07:19:19

解决方案2
0 2021-04-22 11:00:48