Transpose multiple variables in rows to columns depending on a groupby using pandas

Question

This is referred to a questions answered before using SAS. SAS - transpose multiple variables in rows to columns

The new thing is that the length of the variables is not two but varies. Here is an example:

   acct     la   ln  seq1  seq2
0  9999  20.01  100     1    10
1  9999  19.05    1     1    10
2  9999  30.00    1     1    10
3  9999  26.77  100     2    11
4  9999  24.96    1     2    11
5  8888  38.43  218     3    20
6  8888  37.53    1     3    20

My desired output is:

   acct     la   ln  seq1  seq2    la0    la1  la2  la3  ln0  ln1  ln2
5  8888  38.43  218     3    20  38.43  37.53  NaN  NaN  218    1  NaN
0  9999  20.01  100     1    10  20.01  19.05   30  NaN  100    1    1
3  9999  26.77  100     2    11  26.77  24.96  NaN  NaN  100    1  NaN

In SAS I could use proc summary which is fairly simple however I want to get it done in Python since I can't use SAS any longer.

I already solved the question which I can reuse for my problems but I was wondering if there is an easier option in Pandas which I didn't see. Here is my solution. Would be interesting if someone has a faster approach!

# write multiple row to col based on groupby

import pandas as pd
from pandas import DataFrame
import numpy as np 

data = DataFrame({
    "acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888],
    "seq1": [1, 1, 1, 2, 2, 3, 3],
    "seq2": [10, 10, 10, 11, 11, 20, 20],
    "la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53],
    "ln": [100, 1, 1, 100, 1, 218, 1]
    })

# group the variables by some classes
grouped = data.groupby(["acct", "seq1", "seq2"])

def rows_to_col(column, size):
    # create head and contain to iterate through the groupby values
    head = []
    contain = []
    for i,j in grouped:
        head.append(i)
        contain.append(j)

    # transpose the values in contain
    contain_transpose = []
    for i in range(0,len(contain)):
        contain_transpose.append(contain[i][column].tolist())

    # determine the longest list of a sublist
    length = len(max(contain_transpose, key = len))
    # assign missing values to sublist smaller than longest list
    for i in range(0, len(contain_transpose)):
        if len(contain_transpose[i]) != length:
            contain_transpose[i].append("NaN" * (length - len(contain_transpose[i])))

    # create columns for the transposed column values
    for i in range(0, len(contain)):
        for j in range(0, size):
            contain[i][column + str(j)] = np.nan

    # assign the transposed values to the column
    for i in range(0, len(contain)):
        for j in range(0, length):
            contain[i][column + str(j)] = contain_transpose[i][j]

    # now always take the first values of the grouped group
    concat_list = []

    for i in range(0, len(contain)):
        concat_list.append(contain[i][:1])

    return pd.concat(concat_list) # concate the list

# fill in column name and expected size of the column
data_la = rows_to_col("la", 4)
data_ln = rows_to_col("ln", 3)

# merge the two data frames together
cols_use = data_ln.columns.difference(data_la.columns)

data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer")
data_final.drop(["la", "ln"], axis = 1)

Answer 1

Note that:

In [58]:

print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack()
                    0      1   2
acct seq1 seq2                  
8888 3    20    38.43  37.53 NaN
9999 1    10    20.01  19.05  30
     2    11    26.77  24.96 NaN

and:

In [59]:

print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()
                  0  1   2
acct seq1 seq2            
8888 3    20    218  1 NaN
9999 1    10    100  1   1
     2    11    100  1 NaN

Therefore:

In [60]:

df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(),
                 grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()),
                keys= ['la', 'ln'], axis=1)
print df2
                   la              ln       
                    0      1   2    0  1   2
acct seq1 seq2                              
8888 3    20    38.43  37.53 NaN  218  1 NaN
9999 1    10    20.01  19.05  30  100  1   1
     2    11    26.77  24.96 NaN  100  1 NaN

The only problem is that the column index are MultiIndex . If we don't want it, we can transform them to la0.... by:

df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist())

I don't know what do you think. But I prefer the SAS PROC TRANSPOSE syntax for better readability. Pandas syntax is concise but less readable in this particular case.

Transpose multiple variables in rows to columns depending on a groupby using pandas

Question

1 answers

solution1
1 ACCPTED 2015-08-31 16:11:57

Transpose multiple variables in rows to columns depending on a groupby using pandas

Question

1 answers

solution1 1 ACCPTED 2015-08-31 16:11:57

solution1
1 ACCPTED 2015-08-31 16:11:57