简体   繁体   English

如何在Python中使用xarray将多个netCDF文件中的数据连接起来?

[英]How to join data from multiple netCDF files with xarray in Python?

I'm trying to open multiple netCDF files with xarray in Python. 我正在尝试用Python中的xarray打开多个netCDF文件。 The files have data with same shape and I want to join them, creating a new dimension. 文件具有相同形状的数据,我想加入它们,创建一个新的维度。

I tried to use concat_dim argument for xarray.open_mfdataset(), but it doesn't work as expected. 我试图为xarray.open_mfdataset()使用concat_dim参数,但它不能按预期工作。 An example is given below, which open two files with temperature data for 124 times, 241 latitudes and 480 longitudes: 下面给出一个例子,它打开两个文件,温度数据为124次,241个纬度和480个经度:

DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases' )
da_t2m = DS.t2m

print( da_t2m )

With this code, I expect that the result data array will have a shape like (cases: 2, time: 124, latitude: 241, longitude: 480). 使用此代码,我希望结果数据数组的形状类似(例如:2,时间:124,纬度:241,经度:480)。 However, its shape was (cases: 2, time: 248, latitude: 241, longitude: 480). 然而,它的形状是(情况:2,时间:248,纬度:241,经度:480)。 It creates a new dimension, but also sums the leftmost dimension: 'time' dimension of two datasets. 它创建了一个新维度,但也总结了最左边的维度:两个数据集的“时间”维度。 I was wondering whether it's an error from 'xarray.open_mfdateset' or it's an expected behavior because 'time' dimension is UNLIMITED for both datasets. 我想知道这是'xarray.open_mfdateset'中的错误还是预期的行为,因为'time'维度对于两个数据集都是无限的。

Is there a way to join data from these files directly using xarray and get the above expected return? 有没有办法直接使用xarray从这些文件加入数据并获得上述预期的回报?

Thank you. 谢谢。

Mateus 马特乌斯

Extending from my comment I would try this: 从我的评论中扩展我会试试这个:

def preproc(ds):
    ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
    # we might need to tweak this a bit further, depending on the actual data layout
    return ds

DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)

The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension ( time -> ntime ). 这里的好处是,您在重命名原始维度( time - > ntime )时将原始时间坐标保持在stime

If everything works well, you should get resulting dimensions as ( cases , ntime , latitude , longitude ). 如果一切正常,你应该得到的尺寸为( casesntimelatitudelongitude )。

Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess -approach. 免责声明:我在循环中做了类似的最终连接(效果非常好),但没有测试preprocess方法。

The result makes sense if the times are different. 如果时间不同,结果才有意义。

To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. 为了简化它,暂时忘掉lat-lon维度,并想象你有两个文件只是2倍的数据。 The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; 第一个数据的时间步长为1,2,第二个文件的时间步长为3和4.您无法创建时间维度仅为2个时间点的组合数据集; the time dimension variable has to have the times 1,2,3,4. 时间维度变量必须具有1,2,3,4的时间。 So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this: 所以如果你说你想要一个新的维度“案例”,那么数据然后组合成一个二维数组,看起来像这样:

times: 1,2,3,4

cases: 1,2

data: 
               time
          1    2    3    4
cases 1:  x1   x2 
      2:            x3   x4

Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. 可以想象netcdf文件是等效的,时间维度必须跨越两个文件中存在的值范围。 The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, ie point to exactly the same region in time-lat-lon space. 你可以组合两个文件并得到(例如:2,时间:124,纬度:241,经度:480)的唯一方法是,如果两个文件具有相同的时间,lat和lon值,即指向完全相同的区域时间 - 纬度 - 空间。

ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here: ps:对于这个问题有些偏离主题,但是如果你刚刚开始一个新的分析,为什么不转而使用更高分辨率的新一代ERA-5再分析,现在可以追溯到1979年(并最终将扩展)您可以从这里使用python api脚本将其直接下载到桌面:

https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset

Thank you @AdrianTompkins and @jhamman. 谢谢@AdrianTompkins和@jhamman。 After your comments I realize that due different time periods I really can't get what I want, with xarray. 在您的评论之后,我意识到,由于不同的时间段,我真的无法得到我想要的东西,使用xarray。

My main purpose to create such array is to get in one single ND array all data for different events, with same time duration. 我创建这样的数组的主要目的是在一个ND阵列中获取不同事件的所有数据,并且具有相同的持续时间。 Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc). 因此,我可以很容易地获得,例如,每个时间(小时,天等)的所有事件的复合字段。

I'm trying to do the same as I do with NCL. 我正在努力做与NCL一样的事情。 See below a code for NCL that works as expected (for me) for the same data: 请参阅下面的NCL代码,该代码对于相同的数据按预期工作(对我来说):

f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )

The final result is an array with 4 dimensions, with the new one automatically named as ncl_join . 最终结果是一个包含4维的数组,新的数组自动命名为ncl_join

However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. 但是,NCL不尊重时间轴,连接数组并为结果时间轴提供第一个文件的坐标。 So, time axis become useless. 所以,时间轴变得无用。

However, as well said for @AdrianTompkins, the time periods are different and xarray can't join data like this. 但是,对于@AdrianTompkins来说,时间段也不同,xarray也不能像这样加入数据。 So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. 因此,要创建这样的数组,在Python中使用xarray,我认为唯一的方法是从数组中删除时间坐标。 Thus, time dimension would have only integer indexes. 因此,时间维度只有整数索引。

The array given by xarray works like @AdrianThompkins said in his small example. xarray给出的数组就像@AdrianThompkins在他的小例子中所说的那样。 Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. 由于它保留了所有合并数据的时间坐标,我认为与NCL相比,xarray解决方案是正确的。 But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL. 但是,现在我认为复合材料的计算(得到上面给出的相同例子)不会像NCL那样容易地完成。

In a small test, I print two values from merged array with xarray with 在一个小测试中,我使用xarray从合并数组中打印两个值

print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )

What results in 结果如何

252.11412
nan

For the second case, there isn't data for the first time, as expected. 对于第二种情况,第一次没有数据,正如预期的那样。

UPDATE : all answers help me to understand better this problem, so I had to add an update here to also thanks @kmuehlbauer for his answer, indicating that his code give the expected array. 更新 :所有答案都帮助我更好地理解这个问题,所以我不得不在这里添加更新,也感谢@kmuehlbauer的回答,表明他的代码给出了预期的数组。

Again, thank you all for help! 再次,谢谢大家的帮助!

Mateus 马特乌斯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM