简体   繁体   English

Groupby并在熊猫,Python中转置

[英]Groupby and transpose in pandas, python

Dataframe have 数据框有

ID  col  col2   col3   col4

1   A    50      S      1
1   A    52      M      4
1   B    45      N      8
1   C    18      S      7

Dataframe want 想要数据框

ID  col  colA   colB   colC   colD   colE   colF

1   A    50     52      S      M       1      4
1   B    45     NULL    N     NULL     8     NULL
1   C    18     NULL    S     NULL     7     NULL

I want 1 line per unique ID+col (groupby ID and col). 我想要每个唯一ID + col(groupby ID和col)一行。 If there are multiple entries per ID+col (max can be 2, no more) then put the first value of col2 in colA and second value in colB, put the first value of col3 in colC and second value in colD, put the first value of col4 in colE and second value in colF. 如果每个ID + col有多个条目(最大值不能为2,则不能再增加),然后将col2的第一个值放入colA并将第二个值放入colB,将col3的第一个值放入colC和第二个值在colD上,然后将第一个colE中的col4值和colF中的第二个值。 If there is only one entry per ID+col then for col2 put the value in colA and colB is null etc. 如果每个ID + col只有一个条目,那么对于col2,将值放入colA,而colB为null等。

I tried to first create a counter: 我尝试首先创建一个计数器:

df['COUNT'] = df.groupby(['ID','col']).cumcount()+1

From here I was thinking of just adding a column to say 从这里开始,我想添加一个专栏说

if count=1 then df['colA']=df.col2
if count=2 then df['colB']=df.col2

.. but this will still result in the same number of rows as the original df. ..但这仍将导致与原始df相同的行数。

I think need set_index with unstack : 我认为需要set_indexunstack

df['COUNT'] = df.groupby(['ID','col']).cumcount()+1

df = df.set_index(['ID','col', 'COUNT'])['col2'].unstack().add_prefix('col').reset_index()
print (df)
COUNT  ID col  col1  col2
0       1   A  50.0  52.0
1       1   B  45.0   NaN
2       1   C  18.0   NaN

Or: 要么:

c = df.groupby(['ID','col']).cumcount()+1

df = df.set_index(['ID','col', c])['col2'].unstack().add_prefix('col').reset_index()
print (df)
   ID col  col1  col2
0   1   A  50.0  52.0
1   1   B  45.0   NaN
2   1   C  18.0   NaN

EDIT: 编辑:

For multiple columns is solution a bit changed, because working with MultiIndex in columns: 对于多列是解决方案,因为在列中使用MultiIndex ,所以解决方案有所更改:

df['COUNT'] = (df.groupby(['ID','col']).cumcount()+1).astype(str)

#remove col2
df = df.set_index(['ID','col', 'COUNT']).unstack()
#flatten Multiindex
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   ID col  col2_1  col2_2 col3_1 col3_2  col4_1  col4_2
0   1   A    50.0    52.0      S      M     1.0     4.0
1   1   B    45.0     NaN      N   None     8.0     NaN
2   1   C    18.0     NaN      S   None     7.0     NaN

You can using groupby with apply(pd.Series) 您可以将groupbyapply(pd.Series)

df.groupby(['ID','col']).col2.apply(list).apply(pd.Series).add_prefix('col').reset_index()
Out[404]: 
   ID col  col0  col1
0   1   A  50.0  52.0
1   1   B  45.0   NaN
2   1   C  18.0   NaN

Not sure if this is what you looking for, but it renders the same result you are looking for. 不确定这是否是您要查找的内容,但是它会提供与您想要的结果相同的结果。 Please note I am using multiple aggregate function on same column and thus using ravel function to flatten the dataframe columns. 请注意,我在同一列上使用多个聚合函数,因此使用了ravel函数来展平数据框列。

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,1,1], 
                  'Col1':['A','A','B','C'],
                 'Col2':[50,52,45,18]})

df = df.groupby(['ID','Col1']).agg({'Col2':['first','last']})
df.columns = ["_".join(x) for x in df.columns.ravel()]
df = df.reset_index()
df['Col2_last'] = np.where(df.Col2_first == df.Col2_last, float('nan'), df.Col2_last)

print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM