[英]Python Xarray: how to convert a 3-d DataArray to a 2-d stacked Pandas dataframe
I have a 3d xarray
DataArray
volume of data of time series data for multiple runs of a model. 我有一个3d
xarray
DataArray
数据量,用于一个模型的多次运行的时间序列数据。 So the rows are indexed by the simulation timestep, the columns are just a variety of variables captured about the model, and then the depth coordinate represents the individual simulation run, since I run the entire simulation multiple time. 因此,通过模拟时间步长对行进行索引,列只是围绕模型捕获的各种变量,然后深度坐标代表单个模拟运行,因为我多次运行了整个模拟。
My goal is to take this 3d xarray
DataArray
and convert it to a 2d pandas
dataframe so that I can export it to a CSV file. 我的目标是采用3d
xarray
DataArray
并将其转换为2d pandas
数据xarray
,以便将其导出到CSV文件。 I order to do that, I need to stack each of the simulation runs on top of each other, so that the 3d array is converted to a 2d array. 为了做到这一点,我需要将每个模拟运行堆叠在一起,以便将3d数组转换为2d数组。
I have some code to generate some test data, but I am not familiar enough with Xarray
to know how to do this kind of stacking. 我有一些代码可以生成一些测试数据,但是我对
Xarray
不够熟悉, Xarray
知道如何进行这种堆叠。
So here is some code to develop test data. 因此,这是一些用于开发测试数据的代码。
import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm
results_matrix = np.zeros([5, 7, 4])
simulation_matrix = xr.DataArray(results_matrix,
coords={'simdata': ['val1', 'val2','val3','val4'],
'run': range(5),
'year': range(7)},
dims=('run', 'year', 'simdata'))
itercount = 0
for i in tqdm(range(5)):
simulation_matrix[i, :, :] = i
itercount += 1
This code will generate a DataArray that looks like 这段代码将生成一个看起来像
<xarray.DataArray (run: 5, year: 7, simdata: 4)>
array([[[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]],
[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]],
... Additional arrays truncated
I want this converted to a 2d Pandas
dataframe something like 我希望将其转换为2d
Pandas
数据框,例如
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.]]]
Any suggestions? 有什么建议么?
UPDATED: 更新:
Based upon comments from @rahlf23 and @DSM, I had some luck with simulation_matrix.to_dataframe('fred').unstack()
. 根据@ rahlf23和@DSM的评论,我对
simulation_matrix.to_dataframe('fred').unstack()
感到幸运。
fred
simdata val1 val2 val3 val4
run year
0 0 0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
1 0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0
Using your test data, you can use to_pandas()
and pd.concat()
: 使用测试数据,可以使用
to_pandas()
和pd.concat()
:
df = pd.concat([simulation_matrix.loc[i,:,:].to_pandas() for i in range(simulation_matrix.shape[2])])
Yields: 产量:
simdata val1 val2 val3 val4
year
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
5 2.0 2.0 2.0 2.0
6 2.0 2.0 2.0 2.0
0 3.0 3.0 3.0 3.0
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
3 3.0 3.0 3.0 3.0
4 3.0 3.0 3.0 3.0
5 3.0 3.0 3.0 3.0
6 3.0 3.0 3.0 3.0
You can use .to_dataframe
and then unstack
, you just need to pass a name to attach to the dataset (which becomes a column containing that value): 您可以使用
.to_dataframe
然后unstack
,你只需要通过一个名称附加到数据集(成为包含该值的列):
In [41]: simulation_matrix.to_dataframe("results").unstack()
Out[41]:
results
simdata val1 val2 val3 val4
run year
0 0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
1 0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0
2 0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
5 2.0 2.0 2.0 2.0
6 2.0 2.0 2.0 2.0
3 0 3.0 3.0 3.0 3.0
1 3.0 3.0 3.0 3.0
2 3.0 3.0 3.0 3.0
3 3.0 3.0 3.0 3.0
4 3.0 3.0 3.0 3.0
5 3.0 3.0 3.0 3.0
6 3.0 3.0 3.0 3.0
4 0 4.0 4.0 4.0 4.0
1 4.0 4.0 4.0 4.0
2 4.0 4.0 4.0 4.0
3 4.0 4.0 4.0 4.0
4 4.0 4.0 4.0 4.0
5 4.0 4.0 4.0 4.0
6 4.0 4.0 4.0 4.0
All the "run" values are there even though the default representation only shows the first in a repeated group for conciseness: 即使为简洁起见,默认表示形式仅显示重复组中的第一个,所有“运行”值都在那里:
In [50]: df = simulation_matrix.to_dataframe("results").unstack()
In [51]: df.reset_index().head()
Out[51]:
run year results
simdata val1 val2 val3 val4
0 0 0 0.0 0.0 0.0 0.0
1 0 1 0.0 0.0 0.0 0.0
2 0 2 0.0 0.0 0.0 0.0
3 0 3 0.0 0.0 0.0 0.0
4 0 4 0.0 0.0 0.0 0.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.