简体   繁体   English

如何巧妙地转换 arrays 数组中的 Pandas Dataframe?

[英]How to smartly convert a Pandas Dataframe in an array of arrays?

Taking all bellow as an example.以下面的所有为例。

I have the following pandas dataframe:我有以下 pandas dataframe:

name    col1    col2    col3    col4    col5    col6    col7
----    ----    ----    ----    ----    ----    ----    ----
doc1     100       1       1       1       1       1       1
doc1     200       2       2       2       2       2       2
doc1     300       3       3       3       3       3       3
doc2     100      11      11      11      11      11      11
doc2     200      21      21      21      21      21      21
doc2     300      31      31      31      31      31      31
doc2     300      31      31      31      31      31      31
doc3     100      12      12      12      12      12      12
doc3     100      12      12      12      12      12      12
doc3     200      22      22      22      22      22      22
doc3     300      32      32      32      32      32      32

The column name is the one that should be used to aggregate the data.name是应该用于聚合数据的列名。

Now, I need to convert all the data for all the columns colX of a given docX in an array.现在,我需要将给定docX的所有列colX的所有数据转换为一个数组。

And after, finish with an array of arrays.之后,以 arrays 数组结束。

But each object (individual array) must have 5 rows, so each document that do not have 5 rows should be completed with 0's.但是每个 object(单个数组)必须有 5 行,所以每个没有 5 行的文档都应该用 0 完成。

Then in the example above, I would expect to get the following:然后在上面的示例中,我希望得到以下内容:

data = [
    [
        [100,  1,    1,    1,    1,    1,    1],
        [200,  2,    2,    2,    2,    2,    2],
        [300,  3,    3,    3,    3,    3,    3],
        [  0,  0,    0,    0,    0,    0,    0],
        [  0,  0,    0,    0,    0,    0,    0]
    ],
    [
        [100,   11,   11,   11,   11,   11,   11],
        [200,   21,   21,   21,   21,   21,   21],
        [300,   31,   31,   31,   31,   31,   31],
        [300,   31,   31,   31,   31,   31,   31],
        [  0,    0,    0,    0,    0,    0,    0]
    ],
    [
        [100,  12,   12,   12,   12,   12,   12],
        [100,  12,   12,   12,   12,   12,   12],
        [200,  22,   22,   22,   22,   22,   22],
        [300,  32,   32,   32,   32,   32,   32],
        [  0,   0,    0,    0,    0,    0,    0]
    ]
]

data.shape == (3, 5, 7)

How can I do it in a smart way?我怎样才能以聪明的方式做到这一点?

I'm not sure about smartly , but you can try pivot with reindex:我不确定smartly ,但是您可以尝试使用重新索引的 pivot:

tmp = (df.assign(row=df.groupby('name').cumcount())
   .pivot_table(index=['row'],columns=['name'],fill_value=0)
   .reindex(np.arange(5), fill_value=0).T
   .unstack(level=0).to_numpy()
)
out = tmp.reshape(len(ret), 5, -1)

Output: Output:

array([[[100,   1,   1,   1,   1,   1,   1],
        [200,   2,   2,   2,   2,   2,   2],
        [300,   3,   3,   3,   3,   3,   3],
        [  0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0]],

       [[100,  11,  11,  11,  11,  11,  11],
        [200,  21,  21,  21,  21,  21,  21],
        [300,  31,  31,  31,  31,  31,  31],
        [300,  31,  31,  31,  31,  31,  31],
        [  0,   0,   0,   0,   0,   0,   0]],

       [[100,  12,  12,  12,  12,  12,  12],
        [100,  12,  12,  12,  12,  12,  12],
        [200,  22,  22,  22,  22,  22,  22],
        [300,  32,  32,  32,  32,  32,  32],
        [  0,   0,   0,   0,   0,   0,   0]]])

Get the positions of column name where the name changes(from doc1 to doc2 to doc3).获取名称发生变化的列name的位置(从 doc1 到 doc2 到 doc3)。 This will be used to split the dataframe:这将用于拆分 dataframe:

import pandas as pd import numpy as np导入 pandas 作为 pd 导入 numpy 作为 np

split = df.index[~df.name.eq(df.name.shift())][1:]
split
Int64Index([3, 7], dtype='int64')

Split the dataframe with the split variable, using numpy.split :使用split变量拆分 dataframe ,使用numpy.split

df_split = np.split(df.iloc[:, 1:].to_numpy(), split)
df_split
[array([[100,   1,   1,   1,   1,   1,   1],
        [200,   2,   2,   2,   2,   2,   2],
        [300,   3,   3,   3,   3,   3,   3]]),
 array([[100,  11,  11,  11,  11,  11,  11],
        [200,  21,  21,  21,  21,  21,  21],
        [300,  31,  31,  31,  31,  31,  31],
        [300,  31,  31,  31,  31,  31,  31]]),
 array([[100,  12,  12,  12,  12,  12,  12],
        [100,  12,  12,  12,  12,  12,  12],
        [200,  22,  22,  22,  22,  22,  22],
        [300,  32,  32,  32,  32,  32,  32]])]

Get the lengths of the individual arrays - value_counts or list comprehension works fine:获取单个 arrays 的长度 - value_counts 或列表理解工作正常:

split = [len(arr) for arr in df_split]
split
[3, 4, 4]

Create arrays of zeros:创建零的 arrays:

zeros = np.zeros((3, 5, 7))

Finally, fill the zeros with the values from df_split最后,用 df_split 中的值填充零

zeros[0, : split[0]] = df_split[0]
zeros[1, : split[1]] = df_split[1]
zeros[2, : split[2]] = df_split[2]
zeros

array([[[100.,   1.,   1.,   1.,   1.,   1.,   1.],
        [200.,   2.,   2.,   2.,   2.,   2.,   2.],
        [300.,   3.,   3.,   3.,   3.,   3.,   3.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]],

       [[100.,  11.,  11.,  11.,  11.,  11.,  11.],
        [200.,  21.,  21.,  21.,  21.,  21.,  21.],
        [300.,  31.,  31.,  31.,  31.,  31.,  31.],
        [300.,  31.,  31.,  31.,  31.,  31.,  31.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]],

       [[100.,  12.,  12.,  12.,  12.,  12.,  12.],
        [100.,  12.,  12.,  12.,  12.,  12.,  12.],
        [200.,  22.,  22.,  22.,  22.,  22.,  22.],
        [300.,  32.,  32.,  32.,  32.,  32.,  32.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM