简体   繁体   English

xarray 将单个值分配给一个变量/dataArray 最终分配给所有变量/dataArray

[英]xarray assigning individual values to one variable/dataArray ends up assigning to all variables/dataArray

I have a script where I create a big xarray dataset full of np.nan and then assign individual values in a loop, with.loc (I also tried with positional indexing) ( doc )我有一个脚本,我在其中创建了一个充满 np.nan 的大型 xarray 数据集,然后在循环中分配单个值 with.loc (我也尝试使用位置索引)( doc

I get something quite weird.我得到了一些很奇怪的东西。

Here is my minimal reproducible example:这是我最小的可重现示例:

import xarray as xr
import numpy as np

levels = np.arange(0,3)
simNames = ['9airports_filter0dot7_v22']
airportList = ['Windhoek', 'Atlanta', 'Taipei']

emptyDA = xr.DataArray(np.nan, coords = [simNames, airportList, np.arange(0, 20428), levels], 
                       dims = ['simName', 'airport', 'profnum', 'level'])

ds = xr.Dataset({
    'iasi': emptyDA,
    'IM':   emptyDA,
    'IMS': emptyDA,
    'err': emptyDA,
    'sigma': emptyDA,
    'temp': emptyDA, 
    'dfs': emptyDA, 
    'ocf': emptyDA, 
    'rcf': emptyDA, 
    'time': emptyDA, 
    'surfPres': emptyDA })

ds = ds.assign_coords(time = ds.time) # pass time from variable to coord

ds['dfs'].loc['9airports_filter0dot7_v22', 'Windhoek', 0, 0] = 3

I get my scalar "3" assigned to all dataArrays:我将标量“3”分配给所有数据数组:

<xarray.Dataset>
Dimensions:   (simName: 1, airport: 3, profnum: 20428, level: 3)
Coordinates:
  * simName   (simName) <U25 '9airports_filter0dot7_v22'
  * airport   (airport) <U8 'Windhoek' 'Atlanta' 'Taipei'
  * profnum   (profnum) int64 0 1 2 3 4 5 ... 20423 20424 20425 20426 20427
  * level     (level) int64 0 1 2
Data variables:
    iasi      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    IM        (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    IMS       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    err       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    sigma     (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    temp      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    dfs       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    ocf       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    rcf       (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    time      (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan
    surfPres  (simName, airport, profnum, level) float64 3.0 nan nan ... nan nan 

although this simpler code works correctly:虽然这个更简单的代码可以正常工作:

import xarray as xr
import numpy as np

ds = xr.Dataset({'var1': (('x', 'y'), [[np.nan, np.nan],[np.nan, np.nan]]), 'var2': (('x', 'y'), [[np.nan, np.nan], [np.nan, np.nan]])})

ds['var1'].loc[0, 0] = 1

okey, I have understood my mistake: emptyDA is not copied for each new variable but point to the same object.好的,我明白了我的错误:没有为每个新变量复制emptyDA,而是指向同一个object。 Inserting emptyDA.copy() instead of emptyDA resolve the problem.插入 emptyDA.copy() 而不是 emptyDA 可以解决问题。 I thought that the creation of the xarray object would have copied the data.我认为创建 xarray object 会复制数据。 Thanks for your help谢谢你的帮助

The issue is occurring because when you initialize an xarray.Dataset with a dictionary of DataArrays, it makes a shallow copy of the DataArrays, allowing each to have different metadata but not duplicating the underlying numpy arrays.出现此问题是因为当您使用 DataArrays 字典初始化xarray.Dataset时,它会生成 DataArrays 的浅表副本,从而允许每个具有不同的元数据,但不会复制底层 numpy arrays。

You can see this behavior with a small example based on your question.您可以根据您的问题通过一个小示例查看此行为。

First I'll create a new numpy array with all NaNs:首先,我将创建一个包含所有 NaN 的新 numpy 数组:

In [1]: import xarray as xr, numpy as np, pandas as pd

In [2]: np_arr = np.array([np.nan, np.nan, np.nan, np.nan])

In [3]: np_arr
Out[3]: array([nan, nan, nan, nan])

We can see the actual memory address ID here:我们可以在这里看到实际的memory地址ID:

In [4]: hex(id(np_arr))
Out[4]: '0x1186570f0'

Remember this address - we'll come back to it: '0x1186570f0'记住这个地址 - 我们会回到它: '0x1186570f0'

Next we'll create a DataArray wrapping this numpy array:接下来,我们将创建一个包装此 numpy 数组的 DataArray:

In [5]: da = xr.DataArray(np_arr, dims=['x'], coords=[range(4)])

In [6]: da
Out[6]:
<xarray.DataArray (x: 4)>
array([nan, nan, nan, nan])
Coordinates:
  * x        (x) int64 0 1 2 3

The DataArray itself gets a new ID, but the underlying array is just pointing to the same numpy object at '0x1186570f0' : DataArray 本身获得了一个新 ID,但底层数组只是指向相同的 numpy object 在'0x1186570f0'

In [7]: hex(id(da))
Out[7]: '0x118668460'

In [8]: hex(id(da.data))
Out[8]: '0x1186570f0'

When you initialize a Dataset with a dictionary of DataArrays, xarray makes a shallow copy of the arrays.当您使用 DataArrays 字典初始化 Dataset 时,xarray 会生成 arrays 的浅表副本。 Note that the reference to the DataArray address has changed:请注意,对 DataArray 地址的引用已更改:

In [9]: ds = xr.Dataset({'var1': da, 'var2': da})

In [10]: hex(id(ds['var1']))
Out[10]: '0x1186d5340'

In [11]: hex(id(ds['var2']))
Out[11]: '0x1186e0fa0'

This allows each array to have different attributes/metadata这允许每个数组具有不同的属性/元数据

In [12]: ds['var1'].name
Out[12]: 'var1'

In [13]: ds['var2'].name
Out[13]: 'var2'

However, the data is still pointing to the original numpy address:不过数据还是指向原来的numpy地址:

In [14]: hex(id(ds['var1'].data))
Out[14]: '0x1186570f0'

In [15]: hex(id(ds['var2'].data))
Out[15]: '0x1186570f0'

This is a good thing, because it means working with xarray will not blow up your memory usage unless you tell it do.这是一件好事,因为这意味着使用 xarray 不会破坏您的 memory 使用,除非您告诉它这样做。 But you do have to tell it to copy the data if you would like it to.但是,如果您愿意,您必须告诉它复制数据。

You can do this with a deep copy, which xarray.DataArray.copy does by default:您可以使用深层副本执行此操作,默认情况下xarray.DataArray.copy会执行此操作:

In [16]: ds = xr.Dataset({'var1': da.copy(), 'var2': da.copy()})

In [17]: hex(id(ds['var1'].data))
Out[17]: '0x1186b23f0'

In [18]: hex(id(ds['var2'].data))
Out[18]: '0x118660090'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM