简体   繁体   English

Python Xarray:如何将3-d DataArray转换为2-d Stacked Pandas数据框

[英]Python Xarray: how to convert a 3-d DataArray to a 2-d stacked Pandas dataframe

I have a 3d xarray DataArray volume of data of time series data for multiple runs of a model. 我有一个3d xarray DataArray数据量,用于一个模型的多次运行的时间序列数据。 So the rows are indexed by the simulation timestep, the columns are just a variety of variables captured about the model, and then the depth coordinate represents the individual simulation run, since I run the entire simulation multiple time. 因此,通过模拟时间步长对行进行索引,列只是围绕模型捕获的各种变量,然后深度坐标代表单个模拟运行,因为我多次运行了整个模拟。

My goal is to take this 3d xarray DataArray and convert it to a 2d pandas dataframe so that I can export it to a CSV file. 我的目标是采用3d xarray DataArray并将其转换为2d pandas数据xarray ,以便将其导出到CSV文件。 I order to do that, I need to stack each of the simulation runs on top of each other, so that the 3d array is converted to a 2d array. 为了做到这一点,我需要将每个模拟运行堆叠在一起,以便将3d数组转换为2d数组。

I have some code to generate some test data, but I am not familiar enough with Xarray to know how to do this kind of stacking. 我有一些代码可以生成一些测试数据,但是我对Xarray不够熟悉, Xarray知道如何进行这种堆叠。

So here is some code to develop test data. 因此,这是一些用于开发测试数据的代码。

import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm

results_matrix = np.zeros([5, 7, 4])
simulation_matrix = xr.DataArray(results_matrix,
                                      coords={'simdata': ['val1', 'val2','val3','val4'],
                                              'run': range(5),
                                              'year': range(7)},
                                      dims=('run', 'year', 'simdata'))

itercount = 0
for i in tqdm(range(5)):
    simulation_matrix[i, :, :] = i
    itercount += 1

This code will generate a DataArray that looks like 这段代码将生成一个看起来像

<xarray.DataArray (run: 5, year: 7, simdata: 4)>
array([[[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],
 ... Additional arrays truncated

I want this converted to a 2d Pandas dataframe something like 我希望将其转换为2d Pandas数据框,例如

        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]]]

Any suggestions? 有什么建议么?

UPDATED: 更新:

Based upon comments from @rahlf23 and @DSM, I had some luck with simulation_matrix.to_dataframe('fred').unstack() . 根据@ rahlf23和@DSM的评论,我对simulation_matrix.to_dataframe('fred').unstack()感到幸运。

        fred
simdata val1    val2    val3    val4
run year                
0  0    0   0.0 0.0 0.0 0.0
   1    0.0 0.0 0.0 0.0
   2    0.0 0.0 0.0 0.0
   3    0.0 0.0 0.0 0.0
   4    0.0 0.0 0.0 0.0
   5    0.0 0.0 0.0 0.0
   6    0.0 0.0 0.0 0.0
1   0   1.0 1.0 1.0 1.0
   1    1.0 1.0 1.0 1.0
   2    1.0 1.0 1.0 1.0
   3    1.0 1.0 1.0 1.0
   4    1.0 1.0 1.0 1.0
   5    1.0 1.0 1.0 1.0
   6    1.0 1.0 1.0 1.0

Using your test data, you can use to_pandas() and pd.concat() : 使用测试数据,可以使用to_pandas()pd.concat()

df = pd.concat([simulation_matrix.loc[i,:,:].to_pandas() for i in range(simulation_matrix.shape[2])])

Yields: 产量:

simdata  val1  val2  val3  val4
year                           
0         0.0   0.0   0.0   0.0
1         0.0   0.0   0.0   0.0
2         0.0   0.0   0.0   0.0
3         0.0   0.0   0.0   0.0
4         0.0   0.0   0.0   0.0
5         0.0   0.0   0.0   0.0
6         0.0   0.0   0.0   0.0
0         1.0   1.0   1.0   1.0
1         1.0   1.0   1.0   1.0
2         1.0   1.0   1.0   1.0
3         1.0   1.0   1.0   1.0
4         1.0   1.0   1.0   1.0
5         1.0   1.0   1.0   1.0
6         1.0   1.0   1.0   1.0
0         2.0   2.0   2.0   2.0
1         2.0   2.0   2.0   2.0
2         2.0   2.0   2.0   2.0
3         2.0   2.0   2.0   2.0
4         2.0   2.0   2.0   2.0
5         2.0   2.0   2.0   2.0
6         2.0   2.0   2.0   2.0
0         3.0   3.0   3.0   3.0
1         3.0   3.0   3.0   3.0
2         3.0   3.0   3.0   3.0
3         3.0   3.0   3.0   3.0
4         3.0   3.0   3.0   3.0
5         3.0   3.0   3.0   3.0
6         3.0   3.0   3.0   3.0

You can use .to_dataframe and then unstack , you just need to pass a name to attach to the dataset (which becomes a column containing that value): 您可以使用.to_dataframe然后unstack ,你只需要通过一个名称附加到数据集(成为包含该值的列):

In [41]: simulation_matrix.to_dataframe("results").unstack()
Out[41]: 
         results               
simdata     val1 val2 val3 val4
run year                       
0   0        0.0  0.0  0.0  0.0
    1        0.0  0.0  0.0  0.0
    2        0.0  0.0  0.0  0.0
    3        0.0  0.0  0.0  0.0
    4        0.0  0.0  0.0  0.0
    5        0.0  0.0  0.0  0.0
    6        0.0  0.0  0.0  0.0
1   0        1.0  1.0  1.0  1.0
    1        1.0  1.0  1.0  1.0
    2        1.0  1.0  1.0  1.0
    3        1.0  1.0  1.0  1.0
    4        1.0  1.0  1.0  1.0
    5        1.0  1.0  1.0  1.0
    6        1.0  1.0  1.0  1.0
2   0        2.0  2.0  2.0  2.0
    1        2.0  2.0  2.0  2.0
    2        2.0  2.0  2.0  2.0
    3        2.0  2.0  2.0  2.0
    4        2.0  2.0  2.0  2.0
    5        2.0  2.0  2.0  2.0
    6        2.0  2.0  2.0  2.0
3   0        3.0  3.0  3.0  3.0
    1        3.0  3.0  3.0  3.0
    2        3.0  3.0  3.0  3.0
    3        3.0  3.0  3.0  3.0
    4        3.0  3.0  3.0  3.0
    5        3.0  3.0  3.0  3.0
    6        3.0  3.0  3.0  3.0
4   0        4.0  4.0  4.0  4.0
    1        4.0  4.0  4.0  4.0
    2        4.0  4.0  4.0  4.0
    3        4.0  4.0  4.0  4.0
    4        4.0  4.0  4.0  4.0
    5        4.0  4.0  4.0  4.0
    6        4.0  4.0  4.0  4.0

All the "run" values are there even though the default representation only shows the first in a repeated group for conciseness: 即使为简洁起见,默认表示形式仅显示重复组中的第一个,所有“运行”值都在那里:

In [50]: df = simulation_matrix.to_dataframe("results").unstack()

In [51]: df.reset_index().head()
Out[51]: 
        run year results               
simdata             val1 val2 val3 val4
0         0    0     0.0  0.0  0.0  0.0
1         0    1     0.0  0.0  0.0  0.0
2         0    2     0.0  0.0  0.0  0.0
3         0    3     0.0  0.0  0.0  0.0
4         0    4     0.0  0.0  0.0  0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM