将具有 n 级分层索引的 Pandas DataFrame 转换为 nD Numpy 数组

Question

Question题

Is there a good way to transform a DataFrame with an n -level index into an n -D Numpy array (aka n -tensor)?有没有一种好方法可以将具有n级索引的 DataFrame 转换为n -D Numpy 数组（又名n -张量）？

Example例子

Suppose I set up a DataFrame like假设我设置了一个 DataFrame 像

from pandas import DataFrame, MultiIndex

index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0))
print frame

which outputs哪个输出

The index is a 2-level hierarchical index.该索引是一个 2 级分层索引。 I can extract a 2-D Numpy array from the data using我可以使用从数据中提取二维 Numpy 数组

print frame.unstack().values

which outputs哪个输出

[[  0.   1.   2.]
 [ nan   4.   5.]]

How does this generalize to an n -level index?这如何推广到n级索引？

Playing with unstack() , it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.玩弄unstack() ，好像只能用来按摩DataFrame的二维形状，不能加轴。

I cannot use eg frame.values.reshape(x, y, z) , since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed.我不能使用例如frame.values.reshape(x, y, z) ，因为这将要求框架包含准确的x * y * z行，这是无法保证的。 This is what I tried to demonstrate by drop() ing a row in the above example.这就是我在上面的例子中试图通过drop()一行来演示的。

Any suggestions are highly appreciated.任何建议都非常感谢。

Answer 1

Edit .编辑。 This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.这种方法比我在下面给出的方法更优雅（并且快两个数量级）。

# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)

# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat

Original solution .原始解决方案。 Given a setup similar to above, but in 3-D,给定类似于上面的设置，但在 3-D 中，

from pandas import DataFrame, MultiIndex
from itertools import product

index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)

we have我们有

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
  1 0      6
    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.现在，我们继续使用reshape()路线，但进行一些预处理以确保沿每个维度的长度保持一致。

First, reindex the data frame with the full cartesian product of all dimensions.首先，使用所有维度的完整笛卡尔积重新索引数据框。 NaN values will be inserted as needed. NaN值将根据需要插入。 This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.此操作可能既慢又消耗大量内存，具体取决于维数和数据帧的大小。

levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)

which outputs哪个输出

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
    1    NaN
  1 0      6
    1      7

Now, reshape() will work as intended.现在， reshape()将按预期工作。

shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))

which outputs哪个输出

[[[  0.   1.]
  [  2.   3.]]

 [[  4.  nan]
  [  6.   7.]]]

The (rather ugly) one-liner is （相当丑陋的）单线是

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
     .reshape(map(len, frame.index.levels))

将具有 n 级分层索引的 Pandas DataFrame 转换为 nD Numpy 数组

问题描述

Question题

Example例子

1 个解决方案

解决方案1
14 已采纳 2016-01-27 23:08:40

将具有 n 级分层索引的 Pandas DataFrame 转换为 nD Numpy 数组

问题描述

Question题

Example例子

1 个解决方案

解决方案1 14 已采纳 2016-01-27 23:08:40

解决方案1
14 已采纳 2016-01-27 23:08:40