繁体   English   中英

使用 Pandas 根据组将行中的多个变量转置为列

[英]Transpose multiple variables in rows to columns depending on a groupby using pandas

这是在使用 SAS 之前回答的问题。 SAS - 将行中的多个变量转置为列

新的东西是变量的长度不是两个而是变化的。 下面是一个例子:

   acct     la   ln  seq1  seq2
0  9999  20.01  100     1    10
1  9999  19.05    1     1    10
2  9999  30.00    1     1    10
3  9999  26.77  100     2    11
4  9999  24.96    1     2    11
5  8888  38.43  218     3    20
6  8888  37.53    1     3    20

我想要的输出是:

   acct     la   ln  seq1  seq2    la0    la1  la2  la3  ln0  ln1  ln2
5  8888  38.43  218     3    20  38.43  37.53  NaN  NaN  218    1  NaN
0  9999  20.01  100     1    10  20.01  19.05   30  NaN  100    1    1
3  9999  26.77  100     2    11  26.77  24.96  NaN  NaN  100    1  NaN

在 SAS 中,我可以使用相当简单的 proc summary,但是我想在 Python 中完成它,因为我不能再使用 SAS。

我已经解决了可以重用于我的问题的问题,但我想知道 Pandas 中是否有我没有看到的更简单的选项。 这是我的解决方案。 如果有人有更快的方法会很有趣!

# write multiple row to col based on groupby

import pandas as pd
from pandas import DataFrame
import numpy as np 

data = DataFrame({
    "acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888],
    "seq1": [1, 1, 1, 2, 2, 3, 3],
    "seq2": [10, 10, 10, 11, 11, 20, 20],
    "la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53],
    "ln": [100, 1, 1, 100, 1, 218, 1]
    })

# group the variables by some classes
grouped = data.groupby(["acct", "seq1", "seq2"])

def rows_to_col(column, size):
    # create head and contain to iterate through the groupby values
    head = []
    contain = []
    for i,j in grouped:
        head.append(i)
        contain.append(j)

    # transpose the values in contain
    contain_transpose = []
    for i in range(0,len(contain)):
        contain_transpose.append(contain[i][column].tolist())

    # determine the longest list of a sublist
    length = len(max(contain_transpose, key = len))
    # assign missing values to sublist smaller than longest list
    for i in range(0, len(contain_transpose)):
        if len(contain_transpose[i]) != length:
            contain_transpose[i].append("NaN" * (length - len(contain_transpose[i])))

    # create columns for the transposed column values
    for i in range(0, len(contain)):
        for j in range(0, size):
            contain[i][column + str(j)] = np.nan

    # assign the transposed values to the column
    for i in range(0, len(contain)):
        for j in range(0, length):
            contain[i][column + str(j)] = contain_transpose[i][j]

    # now always take the first values of the grouped group
    concat_list = []

    for i in range(0, len(contain)):
        concat_list.append(contain[i][:1])

    return pd.concat(concat_list) # concate the list

# fill in column name and expected size of the column
data_la = rows_to_col("la", 4)
data_ln = rows_to_col("ln", 3)

# merge the two data frames together
cols_use = data_ln.columns.difference(data_la.columns)

data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer")
data_final.drop(["la", "ln"], axis = 1)

注意:

In [58]:

print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack()
                    0      1   2
acct seq1 seq2                  
8888 3    20    38.43  37.53 NaN
9999 1    10    20.01  19.05  30
     2    11    26.77  24.96 NaN

和:

In [59]:

print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()
                  0  1   2
acct seq1 seq2            
8888 3    20    218  1 NaN
9999 1    10    100  1   1
     2    11    100  1 NaN

所以:

In [60]:

df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(),
                 grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()),
                keys= ['la', 'ln'], axis=1)
print df2
                   la              ln       
                    0      1   2    0  1   2
acct seq1 seq2                              
8888 3    20    38.43  37.53 NaN  218  1 NaN
9999 1    10    20.01  19.05  30  100  1   1
     2    11    26.77  24.96 NaN  100  1 NaN

唯一的问题是列索引是MultiIndex 如果我们不想要它,我们可以将它们转换为la0....通过:

df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist())

我不知道你怎么看。 但我更喜欢SAS PROC TRANSPOSE语法以获得更好的可读性。 在这种特殊情况下, Pandas语法简洁但可读性较差。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM