Taking all bellow as an example.
I have the following pandas dataframe:
name col1 col2 col3 col4 col5 col6 col7
---- ---- ---- ---- ---- ---- ---- ----
doc1 100 1 1 1 1 1 1
doc1 200 2 2 2 2 2 2
doc1 300 3 3 3 3 3 3
doc2 100 11 11 11 11 11 11
doc2 200 21 21 21 21 21 21
doc2 300 31 31 31 31 31 31
doc2 300 31 31 31 31 31 31
doc3 100 12 12 12 12 12 12
doc3 100 12 12 12 12 12 12
doc3 200 22 22 22 22 22 22
doc3 300 32 32 32 32 32 32
The column name
is the one that should be used to aggregate the data.
Now, I need to convert all the data for all the columns colX
of a given docX
in an array.
And after, finish with an array of arrays.
But each object (individual array) must have 5 rows, so each document that do not have 5 rows should be completed with 0's.
Then in the example above, I would expect to get the following:
data = [
[
[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]
],
[
[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31],
[ 0, 0, 0, 0, 0, 0, 0]
],
[
[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32],
[ 0, 0, 0, 0, 0, 0, 0]
]
]
data.shape == (3, 5, 7)
How can I do it in a smart way?
I'm not sure about smartly
, but you can try pivot with reindex:
tmp = (df.assign(row=df.groupby('name').cumcount())
.pivot_table(index=['row'],columns=['name'],fill_value=0)
.reindex(np.arange(5), fill_value=0).T
.unstack(level=0).to_numpy()
)
out = tmp.reshape(len(ret), 5, -1)
Output:
array([[[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]],
[[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31],
[ 0, 0, 0, 0, 0, 0, 0]],
[[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32],
[ 0, 0, 0, 0, 0, 0, 0]]])
Get the positions of column name
where the name changes(from doc1 to doc2 to doc3). This will be used to split the dataframe:
import pandas as pd import numpy as np
split = df.index[~df.name.eq(df.name.shift())][1:]
split
Int64Index([3, 7], dtype='int64')
Split the dataframe with the split
variable, using numpy.split
:
df_split = np.split(df.iloc[:, 1:].to_numpy(), split)
df_split
[array([[100, 1, 1, 1, 1, 1, 1],
[200, 2, 2, 2, 2, 2, 2],
[300, 3, 3, 3, 3, 3, 3]]),
array([[100, 11, 11, 11, 11, 11, 11],
[200, 21, 21, 21, 21, 21, 21],
[300, 31, 31, 31, 31, 31, 31],
[300, 31, 31, 31, 31, 31, 31]]),
array([[100, 12, 12, 12, 12, 12, 12],
[100, 12, 12, 12, 12, 12, 12],
[200, 22, 22, 22, 22, 22, 22],
[300, 32, 32, 32, 32, 32, 32]])]
Get the lengths of the individual arrays - value_counts or list comprehension works fine:
split = [len(arr) for arr in df_split]
split
[3, 4, 4]
Create arrays of zeros:
zeros = np.zeros((3, 5, 7))
Finally, fill the zeros with the values from df_split
zeros[0, : split[0]] = df_split[0]
zeros[1, : split[1]] = df_split[1]
zeros[2, : split[2]] = df_split[2]
zeros
array([[[100., 1., 1., 1., 1., 1., 1.],
[200., 2., 2., 2., 2., 2., 2.],
[300., 3., 3., 3., 3., 3., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.]],
[[100., 11., 11., 11., 11., 11., 11.],
[200., 21., 21., 21., 21., 21., 21.],
[300., 31., 31., 31., 31., 31., 31.],
[300., 31., 31., 31., 31., 31., 31.],
[ 0., 0., 0., 0., 0., 0., 0.]],
[[100., 12., 12., 12., 12., 12., 12.],
[100., 12., 12., 12., 12., 12., 12.],
[200., 22., 22., 22., 22., 22., 22.],
[300., 32., 32., 32., 32., 32., 32.],
[ 0., 0., 0., 0., 0., 0., 0.]]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.