[英]Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array
Is there a good way to transform a DataFrame with an n -level index into an n -D Numpy array (aka n -tensor)?有没有一种好方法可以将具有n级索引的 DataFrame 转换为n -D Numpy 数组(又名n -张量)?
Suppose I set up a DataFrame like假设我设置了一个 DataFrame 像
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs哪个输出
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index.该索引是一个 2 级分层索引。 I can extract a 2-D Numpy array from the data using我可以使用从数据中提取二维 Numpy 数组
print frame.unstack().values
which outputs哪个输出
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n -level index?这如何推广到n级索引?
Playing with unstack()
, it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.玩弄unstack()
,好像只能用来按摩DataFrame的二维形状,不能加轴。
I cannot use eg frame.values.reshape(x, y, z)
, since this would require that the frame contains exactly x * y * z
rows, which cannot be guaranteed.我不能使用例如frame.values.reshape(x, y, z)
,因为这将要求框架包含准确的x * y * z
行,这是无法保证的。 This is what I tried to demonstrate by drop()
ing a row in the above example.这就是我在上面的例子中试图通过drop()
一行来演示的。
Any suggestions are highly appreciated.任何建议都非常感谢。
Edit .编辑。 This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.这种方法比我在下面给出的方法更优雅(并且快两个数量级)。
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution .原始解决方案。 Given a setup similar to above, but in 3-D,给定类似于上面的设置,但在 3-D 中,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have我们有
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape()
route, but with some preprocessing to ensure that the length along each dimension will be consistent.现在,我们继续使用reshape()
路线,但进行一些预处理以确保沿每个维度的长度保持一致。
First, reindex the data frame with the full cartesian product of all dimensions.首先,使用所有维度的完整笛卡尔积重新索引数据框。 NaN
values will be inserted as needed. NaN
值将根据需要插入。 This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.此操作可能既慢又消耗大量内存,具体取决于维数和数据帧的大小。
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs哪个输出
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape()
will work as intended.现在, reshape()
将按预期工作。
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs哪个输出
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is (相当丑陋的)单线是
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.