[英]How to convert multiple 2D arrays to 1D columns using xarray and dask in python?
我将 7 个 2D 云优化的 geotiff 堆叠到 xarray 中的一个数据数组中。 它们非常大,所以我正在使用intake-xarray 扩展和dask 从s3 流式传输数据,而不使用任何RAM。 我已经将它们沿着它们的“带”维度连接起来以堆叠它们。
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
merged = xr.concat(to_concat, dim='band')
merged.coords['band'] = datasets # add labels to band dimension
y_array = catalog["NASA_global"]
y_array.coords['band'] = 'NASA_global'
merged
<xarray.DataArray (band: 7, y: 225000, x: 450000)>
dask.array<concatenate, shape=(7, 225000, 450000), dtype=float32, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
* band (band) <U31 'dem' ... 'worldcover'
* y (y) float64 90.0 90.0 90.0 90.0 90.0 ... -90.0 -90.0 -90.0 -90.0
* x (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 180.0
Attributes:
transform: (0.0008, 0.0, -180.0, 0.0, -0.0008, 90.0)
crs: +init=epsg:4326
res: (0.0008, 0.0008)
is_tiled: 1
nodatavals: (32767.0,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: Area
我的问题是我现在如何将数据转换为几个一维列,相当于在 numpy 中展平一个二维数组? 我查看了 .squeeze() 以删除维度,但无法将其转换为所需的格式。 我想做一些机器学习并需要以合适的格式。 dask 和 xarray 的新手。
我真的很感激任何帮助或建议。
对于任何有兴趣的人,我想出了如何在 Xarray 中做到这一点,但它炸毁了我的实例的 memory。
# load the intake catalog
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
# define X and y
merged = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
merged.coords['band'] = datasets
X_array = merged.values
y_array = catalog["NASA_global"].to_dask()
y_array.coords['band'] = 'NASA_global'
# reshape
X_temp = X_array.stack(z=('x','y'))
X = X_temp.transpose('z', 'band')
调用X_array = merged.values
将所有内容加载到 numpy 数组中并终止实例。 一位同事想出了一个不吃memory的更好的解决方案:
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
# define X and y
X_array = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
X_array.coords['band'] = datasets
y_array = catalog["NASA_global"].to_dask()
# reshape
X_table = X_array.data.reshape((7, -1), merge_chunks=True)
y_table = y_array.data.reshape((1, -1), merge_chunks=True)
X = dd.from_dask_array(X_table.T, columns=datasets)
y = dd.from_dask_array(y_table.T, columns=['NASA_global'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.