简体   繁体   中英

How to smartly convert a Pandas Dataframe in an array of arrays?

Taking all bellow as an example.

I have the following pandas dataframe:

name    col1    col2    col3    col4    col5    col6    col7
----    ----    ----    ----    ----    ----    ----    ----
doc1     100       1       1       1       1       1       1
doc1     200       2       2       2       2       2       2
doc1     300       3       3       3       3       3       3
doc2     100      11      11      11      11      11      11
doc2     200      21      21      21      21      21      21
doc2     300      31      31      31      31      31      31
doc2     300      31      31      31      31      31      31
doc3     100      12      12      12      12      12      12
doc3     100      12      12      12      12      12      12
doc3     200      22      22      22      22      22      22
doc3     300      32      32      32      32      32      32

The column name is the one that should be used to aggregate the data.

Now, I need to convert all the data for all the columns colX of a given docX in an array.

And after, finish with an array of arrays.

But each object (individual array) must have 5 rows, so each document that do not have 5 rows should be completed with 0's.

Then in the example above, I would expect to get the following:

data = [
    [
        [100,  1,    1,    1,    1,    1,    1],
        [200,  2,    2,    2,    2,    2,    2],
        [300,  3,    3,    3,    3,    3,    3],
        [  0,  0,    0,    0,    0,    0,    0],
        [  0,  0,    0,    0,    0,    0,    0]
    ],
    [
        [100,   11,   11,   11,   11,   11,   11],
        [200,   21,   21,   21,   21,   21,   21],
        [300,   31,   31,   31,   31,   31,   31],
        [300,   31,   31,   31,   31,   31,   31],
        [  0,    0,    0,    0,    0,    0,    0]
    ],
    [
        [100,  12,   12,   12,   12,   12,   12],
        [100,  12,   12,   12,   12,   12,   12],
        [200,  22,   22,   22,   22,   22,   22],
        [300,  32,   32,   32,   32,   32,   32],
        [  0,   0,    0,    0,    0,    0,    0]
    ]
]

data.shape == (3, 5, 7)

How can I do it in a smart way?

I'm not sure about smartly , but you can try pivot with reindex:

tmp = (df.assign(row=df.groupby('name').cumcount())
   .pivot_table(index=['row'],columns=['name'],fill_value=0)
   .reindex(np.arange(5), fill_value=0).T
   .unstack(level=0).to_numpy()
)
out = tmp.reshape(len(ret), 5, -1)

Output:

array([[[100,   1,   1,   1,   1,   1,   1],
        [200,   2,   2,   2,   2,   2,   2],
        [300,   3,   3,   3,   3,   3,   3],
        [  0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0]],

       [[100,  11,  11,  11,  11,  11,  11],
        [200,  21,  21,  21,  21,  21,  21],
        [300,  31,  31,  31,  31,  31,  31],
        [300,  31,  31,  31,  31,  31,  31],
        [  0,   0,   0,   0,   0,   0,   0]],

       [[100,  12,  12,  12,  12,  12,  12],
        [100,  12,  12,  12,  12,  12,  12],
        [200,  22,  22,  22,  22,  22,  22],
        [300,  32,  32,  32,  32,  32,  32],
        [  0,   0,   0,   0,   0,   0,   0]]])

Get the positions of column name where the name changes(from doc1 to doc2 to doc3). This will be used to split the dataframe:

import pandas as pd import numpy as np

split = df.index[~df.name.eq(df.name.shift())][1:]
split
Int64Index([3, 7], dtype='int64')

Split the dataframe with the split variable, using numpy.split :

df_split = np.split(df.iloc[:, 1:].to_numpy(), split)
df_split
[array([[100,   1,   1,   1,   1,   1,   1],
        [200,   2,   2,   2,   2,   2,   2],
        [300,   3,   3,   3,   3,   3,   3]]),
 array([[100,  11,  11,  11,  11,  11,  11],
        [200,  21,  21,  21,  21,  21,  21],
        [300,  31,  31,  31,  31,  31,  31],
        [300,  31,  31,  31,  31,  31,  31]]),
 array([[100,  12,  12,  12,  12,  12,  12],
        [100,  12,  12,  12,  12,  12,  12],
        [200,  22,  22,  22,  22,  22,  22],
        [300,  32,  32,  32,  32,  32,  32]])]

Get the lengths of the individual arrays - value_counts or list comprehension works fine:

split = [len(arr) for arr in df_split]
split
[3, 4, 4]

Create arrays of zeros:

zeros = np.zeros((3, 5, 7))

Finally, fill the zeros with the values from df_split

zeros[0, : split[0]] = df_split[0]
zeros[1, : split[1]] = df_split[1]
zeros[2, : split[2]] = df_split[2]
zeros

array([[[100.,   1.,   1.,   1.,   1.,   1.,   1.],
        [200.,   2.,   2.,   2.,   2.,   2.,   2.],
        [300.,   3.,   3.,   3.,   3.,   3.,   3.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]],

       [[100.,  11.,  11.,  11.,  11.,  11.,  11.],
        [200.,  21.,  21.,  21.,  21.,  21.,  21.],
        [300.,  31.,  31.,  31.,  31.,  31.,  31.],
        [300.,  31.,  31.,  31.,  31.,  31.,  31.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]],

       [[100.,  12.,  12.,  12.,  12.,  12.,  12.],
        [100.,  12.,  12.,  12.,  12.,  12.,  12.],
        [200.,  22.,  22.,  22.,  22.,  22.,  22.],
        [300.,  32.,  32.,  32.,  32.,  32.,  32.],
        [  0.,   0.,   0.,   0.,   0.,   0.,   0.]]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM