简体   繁体   中英

Transpose multiple variables in rows to columns depending on a groupby using pandas

This is referred to a questions answered before using SAS. SAS - transpose multiple variables in rows to columns

The new thing is that the length of the variables is not two but varies. Here is an example:

   acct     la   ln  seq1  seq2
0  9999  20.01  100     1    10
1  9999  19.05    1     1    10
2  9999  30.00    1     1    10
3  9999  26.77  100     2    11
4  9999  24.96    1     2    11
5  8888  38.43  218     3    20
6  8888  37.53    1     3    20

My desired output is:

   acct     la   ln  seq1  seq2    la0    la1  la2  la3  ln0  ln1  ln2
5  8888  38.43  218     3    20  38.43  37.53  NaN  NaN  218    1  NaN
0  9999  20.01  100     1    10  20.01  19.05   30  NaN  100    1    1
3  9999  26.77  100     2    11  26.77  24.96  NaN  NaN  100    1  NaN

In SAS I could use proc summary which is fairly simple however I want to get it done in Python since I can't use SAS any longer.

I already solved the question which I can reuse for my problems but I was wondering if there is an easier option in Pandas which I didn't see. Here is my solution. Would be interesting if someone has a faster approach!

# write multiple row to col based on groupby

import pandas as pd
from pandas import DataFrame
import numpy as np 

data = DataFrame({
    "acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888],
    "seq1": [1, 1, 1, 2, 2, 3, 3],
    "seq2": [10, 10, 10, 11, 11, 20, 20],
    "la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53],
    "ln": [100, 1, 1, 100, 1, 218, 1]
    })

# group the variables by some classes
grouped = data.groupby(["acct", "seq1", "seq2"])

def rows_to_col(column, size):
    # create head and contain to iterate through the groupby values
    head = []
    contain = []
    for i,j in grouped:
        head.append(i)
        contain.append(j)

    # transpose the values in contain
    contain_transpose = []
    for i in range(0,len(contain)):
        contain_transpose.append(contain[i][column].tolist())

    # determine the longest list of a sublist
    length = len(max(contain_transpose, key = len))
    # assign missing values to sublist smaller than longest list
    for i in range(0, len(contain_transpose)):
        if len(contain_transpose[i]) != length:
            contain_transpose[i].append("NaN" * (length - len(contain_transpose[i])))

    # create columns for the transposed column values
    for i in range(0, len(contain)):
        for j in range(0, size):
            contain[i][column + str(j)] = np.nan

    # assign the transposed values to the column
    for i in range(0, len(contain)):
        for j in range(0, length):
            contain[i][column + str(j)] = contain_transpose[i][j]

    # now always take the first values of the grouped group
    concat_list = []

    for i in range(0, len(contain)):
        concat_list.append(contain[i][:1])

    return pd.concat(concat_list) # concate the list

# fill in column name and expected size of the column
data_la = rows_to_col("la", 4)
data_ln = rows_to_col("ln", 3)

# merge the two data frames together
cols_use = data_ln.columns.difference(data_la.columns)

data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer")
data_final.drop(["la", "ln"], axis = 1)

Note that:

In [58]:

print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack()
                    0      1   2
acct seq1 seq2                  
8888 3    20    38.43  37.53 NaN
9999 1    10    20.01  19.05  30
     2    11    26.77  24.96 NaN

and:

In [59]:

print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()
                  0  1   2
acct seq1 seq2            
8888 3    20    218  1 NaN
9999 1    10    100  1   1
     2    11    100  1 NaN

Therefore:

In [60]:

df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(),
                 grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()),
                keys= ['la', 'ln'], axis=1)
print df2
                   la              ln       
                    0      1   2    0  1   2
acct seq1 seq2                              
8888 3    20    38.43  37.53 NaN  218  1 NaN
9999 1    10    20.01  19.05  30  100  1   1
     2    11    26.77  24.96 NaN  100  1 NaN

The only problem is that the column index are MultiIndex . If we don't want it, we can transform them to la0.... by:

df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist())

I don't know what do you think. But I prefer the SAS PROC TRANSPOSE syntax for better readability. Pandas syntax is concise but less readable in this particular case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM