如何细分/优化 xarray 数据集中的维度？

Question

Summary: I have a dataset that is collected in such a way that the dimensions are not initially available.摘要：我有一个数据集，该数据集以维度最初不可用的方式收集。 I would like to take what is essentially a big block of undifferentiated data and add dimensions to it so that it can be queried, subsetted, etc. That is the core of the following question.我想获取本质上是一大块未区分数据的内容，并为其添加维度，以便可以对其进行查询、子集化等。这是以下问题的核心。

Here is an xarray DataSet that I have:这是我拥有的一个 xarray 数据集：

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     (rows) int64 0 1 2 3 4 5 6 ... 23994 23995 23996 23997 23998 23999
Data variables:
    obs      (chain, draw, rows) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

The rows dimension here corresponds to a number of subdimensions that I need to restore to the data.这里的rows维度对应于我需要恢复到数据的多个子维度。 In particular, the 24,000 rows correspond to 100 samples each from 240 conditions (these 100 samples are in contiguous blocks).特别是，这 24,000 行分别对应来自 240 个条件的 100 个样本（这 100 个样本位于连续块中）。 These conditions are combinations of gate , input , growth medium , and od .这些条件是gate 、 input 、 growth medium和od 。

I would like to end up with something like this:我想以这样的方式结束：

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, gate: 1, input: 4, growth_medium: 3, sample: 100, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     *MultiIndex*
  * gate     (gate) int64 'AND'
  * input    (input) int64 '00', '01', '10', '11'
  * growth_medium (growth_medium) 'standard', 'rich', 'slow'
  * sample   (sample) int64 0 1 2 3 4 5 6 7 ... 95 96 97 98 99
Data variables:
    obs      (chain, draw, gate, input, growth_medium, samples) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

I have a pandas dataframe that specifies the values of gate, input, and growth medium -- each row gives a set of values of gate, input, and growth medium, and an index that specifies where (in the rows ) the corresponding set of 100 samples appears.我有一个 Pandas 数据框，它指定了门、输入和生长培养基的值——每一行给出了一组门、输入和生长培养基的值，以及一个索引，指定了（ rows ）对应的一组出现 100 个样本。 The intent is that this data frame is a guide for labeling the Dataset.目的是该数据框是标记数据集的指南。

I looked at the xarray docs on "Reshaping and Reorganizing Data", but I don't see how to combine those operations to do what I need.我查看了关于“重塑和重组数据”的 xarray 文档，但我没有看到如何组合这些操作来完成我需要的操作。 I suspect somehow I need to combine these with GroupBy , but I don't get how.我怀疑我需要以某种方式将这些与GroupBy结合起来，但我不明白如何。 Thanks!谢谢！

Later: I have a solution to this problem, but it is so disgusting that I am hoping someone will explain how wrong I am, and what a more elegant approach is possible.后来：我有一个解决这个问题的方法，但它太恶心了，我希望有人解释我是多么错误，以及可能有什么更优雅的方法。

So, first, I extracted all the data in the original Dataset into raw numpy form:因此，首先，我将原始Dataset中的所有数据提取为原始 numpy 形式：

foo = qm.idata.posterior_predictive['obs'].squeeze('chain').values.T
foo.shape # (24000, 2000)

Then I reshaped it as needed:然后我根据需要对其进行了改造：

bar = np.reshape(foo, (240, 100, 2000))

This gives me roughly the shape I want: there are 240 different experimental conditions, each has 100 variants, and for each of these variants, I have 2000 Monte Carlo samples in my data set.这给了我我想要的大致形状：有 240 个不同的实验条件，每个都有 100 个变体，对于这些变体中的每一个，我的数据集中有 2000 个蒙特卡罗样本。

Now, I extract the information about the 240 experimental conditions from the Pandas DataFrame :现在，我从 Pandas DataFrame提取有关 240 个实验条件的信息：

import pandas as pd
# qdf is the original dataframe with the experimental conditions and some
# extraneous information in other columns
new_df = qdf[['gate', 'input', 'output', 'media', 'od_lb', 'od_ub', 'temperature']]
idx = pd.MultiIndex.from_frame(new_df)

Finally, I reassembled a DataArray from the numpy array and the pandas MultiIndex :最后，我从 numpy 数组和 pandas MultiIndex重新组装了一个DataArray ：

xr.DataArray(bar, name='obs', dims=['regions', 'conditions', 'draws'],
             coords={'regions': idx, 'conditions': range(100), 'draws': range(2000)})

The resulting DataArray has these coordinates, as I wished:生成的DataArray具有这些坐标，如我所愿：

Coordinates:
  * regions      (regions) MultiIndex
  - gate         (regions) object 'AND' 'AND' 'AND' 'AND' ... 'AND' 'AND' 'AND'
  - input        (regions) object '00' '10' '10' '10' ... '01' '01' '11' '11'
  - output       (regions) object '0' '0' '0' '0' '0' ... '0' '0' '0' '1' '1'
  - media        (regions) object 'standard_media' ... 'high_osm_media_five_percent'
  - od_lb        (regions) float64 0.0 0.001 0.001 ... 0.0001 0.0051 0.0051
  - od_ub        (regions) float64 0.0001 0.0051 0.0051 2.0 ... 0.0003 2.0 2.0
  - temperature  (regions) int64 30 30 37 30 37 30 37 ... 37 30 37 30 37 30 37
  * conditions   (conditions) int64 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99
  * draws        (draws) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999

That was pretty horrible, though, and it seems wrong that I had to punch through all the nice layers of xarray abstraction to get to this point.不过，这太可怕了，而且我必须通过xarray抽象的所有漂亮层才能达到这一点似乎是错误的。 Especially since this does not seem like an unusual piece of a scientific workflow: getting a relatively raw data set together with a spreadsheet of metadata that needs to be combined with the data.特别是因为这看起来不像是科学工作流程中的一个不寻常的部分：将相对原始的数据集与需要与数据结合的元数据电子表格一起获取。 So what am I doing wrong?那么我做错了什么？ What's the more elegant solution?什么是更优雅的解决方案？

Answer 1

Given the starting Dataset, similar to:给定起始数据集，类似于：

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

You can concatenate several pure xarray commands to subdivide the dimensions (get the data in the same shape but using a multiindex) or even reshape the Dataset.您可以连接几个纯 xarray 命令来细分维度（以相同的形状获取数据，但使用多索引）甚至重塑数据集。 To subdivide the dimensions, the following code can be used:要细分维度，可以使用以下代码：

multiindex_ds = ds.assign_coords(
    dim_0=["a", "b", "c"], dim_1=[0,1], dim_2=range(4)
).stack(
    dim=("dim_0", "dim_1", "dim_2")
).reset_index(
    "row", drop=True
).rename(
    row="dim"
)
multiindex_ds

whose output is:其输出为：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, dim) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

Moreover, the multiindex can then be unstacked, effectively reshaping the Dataset:此外，多索引然后可以被取消堆叠，有效地重塑数据集：

reshaped_ds = multiindex_ds.unstack("dim")
reshaped_ds

with output:带输出：

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim_0    (dim_0) object 'a' 'b' 'c'
  * dim_1    (dim_1) int64 0 1
  * dim_2    (dim_2) int64 0 1 2 3
Data variables:
    obs      (draw, dim_0, dim_1, dim_2) int32 0 1 2 3 4 5 ... 42 43 44 45 46 47

I think that this alone does not completely cover your needs because you want to convert a dimension into two dimensions, one of which is to be a multiindex.我认为仅凭这一点并不能完全满足您的需求，因为您想将一个维度转换为两个维度，其中一个维度是多索引。 All the building blocks are here though.不过，所有的构建块都在这里。

For example, you can follow this steps (including unstacking) with regions and conditions and then follow this steps (no unstacking now) to convert regions to multiindex.例如，您可以使用regions和conditions此步骤（包括拆垛），然后按照此步骤（现在不拆垛）将regions转换为多索引。 Another option would be to use all dimensions from the start, unstack them and then stack them again leaving conditions outside of the final multiindex.另一种选择是从一开始就使用所有维度，将它们拆开，然后再次将它们堆叠起来，留下最终多索引之外的conditions 。

Detailed answer详细解答

The answer combines several quite unrelated commands, and it might be tricky to see what each of them is doing.答案结合了几个非常不相关的命令，可能很难看到每个命令在做什么。

`assign_coords`

The first step is to create new dimensions and coordinates and add them to the Dataset.第一步是创建新的维度和坐标并将它们添加到数据集。 This is necessary because the next methods need the dimensions and coordinates to already be present in the Dataset.这是必要的，因为下一个方法需要维度和坐标已经存在于数据集中。

Stopping right after assign_coords yields the following Dataset:在assign_coords之后assign_coords停止assign_coords产生以下数据集：

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim_0    (dim_0) <U1 'a' 'b' 'c'
  * dim_1    (dim_1) int32 0 1
  * dim_2    (dim_2) int32 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

`stack`

The Dataset now contains 3 dimensions that add up to 24 elements, however, as the data is currently flat with respect to these 24 elements, we have to stack them into a single 24 element multiindex to make their shapes compatible.数据集现在包含 3 个维度，总共有 24 个元素，但是，由于数据目前相对于这 24 个元素是平坦的，我们必须将它们堆叠成一个 24 元素的多索引，以使其形状兼容。

I find the assign_coords followed by stack the most natural solution, however, another possibility would be to generate a multiindex similarly to how it is done above and directly call assign_coords with the multiindex, rendering the stack unnecessary.我发现assign_coords后跟stack是最自然的解决方案，但是，另一种可能性是生成一个多索引，类似于上面的做法，并直接使用多assign_coords调用assign_coords ，从而使堆栈assign_coords不必要。

This step combines all 3 new dimensions into a single one:此步骤将所有 3 个新维度合并为一个：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

Note that as desired now we have 2 dimensions with size 24 as desired.请注意，根据需要，现在我们有 2 个尺寸为 24 的维度。

`reset_index`

Now we have our final dimension present in the Dataset as a coordinate, and we want this new coordinate to be the one used to index the variable obs .现在我们在 Dataset 中将最终维度作为坐标存在，并且我们希望这个新坐标是用于索引变量obs坐标。 set_index seems like the correct choice, however, each of our coordinates indexes itself (unlike the example in set_index docs where x indexes both x and a coordinates) which means that set_index cannot be used in this particular case. set_index似乎是正确的选择，但是，我们的每个坐标索引本身（与set_index文档中的示例不同，其中x索引x和a坐标）这意味着set_index不能在这种特殊情况下使用。 The method to use is reset_index to remove the coordinate row without removing the dimension row .使用的方法是reset_index删除坐标row而不删除维度row 。

In the following output it can be seen how now row is a dimension without coordinates:在下面的输出中，可以看出现在row是一个没有坐标的维度：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Dimensions without coordinates: row
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

`rename`

The current Dataset is nearly the final one, the only issue is that the obs variable still has the row dimension instead of the desired one: dim .当前数据集几乎是最后一个，唯一的问题是obs变量仍然具有row维度而不是所需的维度： dim 。 It does not really look like this is intended usage of rename but it can be used to get dim to absorb row , yielding the desired final result (called multiindex_ds above).它看起来并不是rename预期用途，但它可以用来dim以吸收row ，产生所需的最终结果（上面称为multiindex_ds ）。

Here again, set_index seems to be the method to choose, however, if instead of rename(row="dim") , set_index(row="dim") is used, the multiindex is collapsed into an index made of tuples:再次， set_index似乎是选择的方法，但是，如果使用set_index(row="dim")而不是rename(row="dim") ，则多索引将折叠set_index(row="dim")元组组成的索引：

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) object ('a', 0, 0) ('a', 0, 1) ... ('c', 1, 2) ('c', 1, 3)
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

如何细分/优化 xarray 数据集中的维度？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-10 16:44:01

Detailed answer详细解答

`assign_coords`

`stack`

`reset_index`

`rename`

如何细分/优化 xarray 数据集中的维度？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-10 16:44:01

Detailed answer详细解答

assign_coords

stack

reset_index

rename

解决方案1
1 已采纳 2020-01-10 16:44:01

`assign_coords`

`stack`

`reset_index`

`rename`