将带有索引的numpy数组转换为熊猫数据框

Question

我有一个要用python ggplot的tile打印的numpy数组。 为此，我需要有一个带有x，y，value列的DataFrame。 我如何才能有效地将numpy数组转换为这样的DataFrame。 请考虑一下，我想要的数据形式是稀疏样式，但是我想要一个常规的DataFrame。 我尝试使用像将稀疏矩阵（csc_matrix）转换为pandas dataframe那样的稀疏数据结构，但转换速度太慢且内存不足：我的内存用完了。

为了澄清我想要什么：

我从一个像

array([[ 1,  3,  7],
       [ 4,  9,  8]])

我想以DataFrame结尾

     x    y    value
0    0    0    1
1    0    1    3
2    0    2    7
3    1    0    4
4    1    1    9
5    1    2    8

Answer 1

arr = np.array([[1, 3, 7],
                [4, 9, 8]])

df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                    arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

您还可以考虑使用此答案中使用的函数，以np.indices上述解决方案中的np.indices ：

def indices_merged_arr(arr):
    m,n = arr.shape
    I,J = np.ogrid[:m,:n]
    out = np.empty((m,n,3), dtype=arr.dtype)
    out[...,0] = I
    out[...,1] = J
    out[...,2] = arr
    out.shape = (-1,3)
    return out

array = np.array([[ 1,  3,  7],
                  [ 4,  9,  8]])

df = pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
print(df)

   x  y  value
0  0  0      1
1  0  1      3
2  0  2      7
3  1  0      4
4  1  1      9
5  1  2      8

性能

arr = np.random.randn(1000, 1000)

%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
                         arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 15.3 ms per loop

%timeit pd.DataFrame(indices_merged_arr(array), columns=['x', 'y', 'value'])
1000 loops, best of 3: 229 µs per loop

将带有索引的numpy数组转换为熊猫数据框

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-24 08:04:20

将带有索引的numpy数组转换为熊猫数据框

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-24 08:04:20

解决方案1
1 已采纳 2017-08-24 08:04:20