This is referred to a questions answered before using SAS. SAS - transpose multiple variables in rows to columns
The new thing is that the length of the variables is not two but varies. Here is an example:
acct la ln seq1 seq2
0 9999 20.01 100 1 10
1 9999 19.05 1 1 10
2 9999 30.00 1 1 10
3 9999 26.77 100 2 11
4 9999 24.96 1 2 11
5 8888 38.43 218 3 20
6 8888 37.53 1 3 20
My desired output is:
acct la ln seq1 seq2 la0 la1 la2 la3 ln0 ln1 ln2
5 8888 38.43 218 3 20 38.43 37.53 NaN NaN 218 1 NaN
0 9999 20.01 100 1 10 20.01 19.05 30 NaN 100 1 1
3 9999 26.77 100 2 11 26.77 24.96 NaN NaN 100 1 NaN
In SAS I could use proc summary which is fairly simple however I want to get it done in Python since I can't use SAS any longer.
I already solved the question which I can reuse for my problems but I was wondering if there is an easier option in Pandas which I didn't see. Here is my solution. Would be interesting if someone has a faster approach!
# write multiple row to col based on groupby
import pandas as pd
from pandas import DataFrame
import numpy as np
data = DataFrame({
"acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888],
"seq1": [1, 1, 1, 2, 2, 3, 3],
"seq2": [10, 10, 10, 11, 11, 20, 20],
"la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53],
"ln": [100, 1, 1, 100, 1, 218, 1]
})
# group the variables by some classes
grouped = data.groupby(["acct", "seq1", "seq2"])
def rows_to_col(column, size):
# create head and contain to iterate through the groupby values
head = []
contain = []
for i,j in grouped:
head.append(i)
contain.append(j)
# transpose the values in contain
contain_transpose = []
for i in range(0,len(contain)):
contain_transpose.append(contain[i][column].tolist())
# determine the longest list of a sublist
length = len(max(contain_transpose, key = len))
# assign missing values to sublist smaller than longest list
for i in range(0, len(contain_transpose)):
if len(contain_transpose[i]) != length:
contain_transpose[i].append("NaN" * (length - len(contain_transpose[i])))
# create columns for the transposed column values
for i in range(0, len(contain)):
for j in range(0, size):
contain[i][column + str(j)] = np.nan
# assign the transposed values to the column
for i in range(0, len(contain)):
for j in range(0, length):
contain[i][column + str(j)] = contain_transpose[i][j]
# now always take the first values of the grouped group
concat_list = []
for i in range(0, len(contain)):
concat_list.append(contain[i][:1])
return pd.concat(concat_list) # concate the list
# fill in column name and expected size of the column
data_la = rows_to_col("la", 4)
data_ln = rows_to_col("ln", 3)
# merge the two data frames together
cols_use = data_ln.columns.difference(data_la.columns)
data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer")
data_final.drop(["la", "ln"], axis = 1)
Note that:
In [58]:
print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack()
0 1 2
acct seq1 seq2
8888 3 20 38.43 37.53 NaN
9999 1 10 20.01 19.05 30
2 11 26.77 24.96 NaN
and:
In [59]:
print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()
0 1 2
acct seq1 seq2
8888 3 20 218 1 NaN
9999 1 10 100 1 1
2 11 100 1 NaN
Therefore:
In [60]:
df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(),
grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()),
keys= ['la', 'ln'], axis=1)
print df2
la ln
0 1 2 0 1 2
acct seq1 seq2
8888 3 20 38.43 37.53 NaN 218 1 NaN
9999 1 10 20.01 19.05 30 100 1 1
2 11 26.77 24.96 NaN 100 1 NaN
The only problem is that the column index are MultiIndex
. If we don't want it, we can transform them to la0....
by:
df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist())
I don't know what do you think. But I prefer the SAS
PROC TRANSPOSE
syntax for better readability. Pandas
syntax is concise but less readable in this particular case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.