简体   繁体   English

在 Python 中重塑和组合来自 netCDF 的数据

[英]Reshaping and combining data from netCDF in Python

I'm currently reading in a netCDF file using xarray in Python with a variety of 3 hourly temperature (t2m) data.我目前正在使用 Python 中的 xarray 读取带有各种 3 小时温度 (t2m) 数据的 netCDF 文件。 The format of the data is (time: 2920, latitude: 189, longitude: 521) or (2920, 189, 521) which represents a year of data.数据的格式为(时间:2920,纬度:189,经度:521)或(2920,189,521),表示一年的数据。 I have 30 of these files 2GB each.我有 30 个这些文件,每个 2GB。

longitude (longitude) float32         -170.0 -169.8 ... -40.25 -40.0
latitude  (latitude)  float32         82.0 81.75 81.5 ... 35.5 35.25 35.0
time      (time)      datetime64[ns]  1979-01-01T01:00:00 ... 1979-12-...

I would like to reshape this data into a format which I can feed into scikit-learn's我想将这些数据重塑为可以输入 scikit-learn 的格式

sklearn.model_selection.train_test_split

ie I would like to generate the following DataFrame for each file/year:即我想为每个文件/年生成以下 DataFrame:

index   time                  lat   lon       t2m
0       1979-01-01T00:00:00   35    -170      270
1       1979-01-01T00:00:00   35    -169.75   269
2       1979-01-01T00:00:00   35    -169.5    271
...
n-1     1979-12-31T21:00:00   82    -40.25    241
n       1979-12-31T21:00:00   82    -40       244

Note that we would have 521 lat=35 rows before moving onto the next latitude value.请注意,在移动到下一个纬度值之前,我们将有 521 lat=35 行。 After we get through all 189 latitude values we then go to the next timestep and repeat until finished.在我们通过所有 189 个纬度值之后,我们然后 go 到下一个时间步并重复直到完成。

I assume there is a way to achieve what I want with some combination of melting and reshaping of the xarray ds but I've yet to find anything that works.我认为有一种方法可以通过融合和重塑 xarray ds 的某种组合来实现我想要的,但我还没有找到任何可行的方法。 Any advice would be appreciated.任何意见,将不胜感激。

This should be achievable with xarray's built in methods, as shown below.这应该可以通过 xarray 的内置方法来实现,如下所示。 There are possibly more commands here than you need.这里的命令可能比您需要的多。 One thing to be careful about when converting xarray datasets to dataframes is if coordinates have "bounds" it can duplicate values, but the code below should deal with that.将 xarray 数据集转换为数据帧时要注意的一件事是,如果坐标有“边界”,它可以重复值,但下面的代码应该处理这个问题。

df = (ds
      # convert to dataframe
      .to_dataframe()
      # convert time and lon/lat to columns
      .reset_index()
      # only select what you want, in case there are bnds etc. in the data
      .loc[:,["time", "lon", "lat", "t2m"]]
      # remove duplicates that could be introduced by bnds
      .drop_duplicates()
      # add an index
      .reset_index()
      )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM