简体   繁体   中英

Pandas: Sort before aggregate within a group

I have the following Pandas dataframe:

A        B         C
A        A         Test1
A        A         Test2
A        A         XYZ
A        B         BA
A        B         AB
B        A         AA

I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:

A        A        Test1,Test2,XYZ
A        B        AB, BA
B        A        AA

And the final result should be:

A        A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B        B,A:(AA)

Concatenating itself works, however the sorting does not seem work.

Can anyone help me with this problem?

Kind regards.

Using groupby + join

s1=df.groupby(['A','B']).C.apply(','.join)

s1
Out[421]: 
A  B
A  A    Test1,Test2,XYZ
   B              BA,AB
B  A                 AA
Name: C, dtype: object

s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]: 
A
A    {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B                                   {('B', 'A'): 'AA'}
dtype: object

First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:

df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
   A  B                C    D
0  A  A  Test1,Test2,XYZ  A,A
1  A  B            AB,BA  A,B
2  B  A               AA  B,A

s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
   A                                         val
0  A  {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1  B                               {'B,A': 'AA'}

If need tuples only change first part of code:

df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
   A  B                    C    D
0  A  A  (Test1, Test2, XYZ)  A,A
1  A  B             (AB, BA)  A,B
2  B  A                (AA,)  B,A

s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
   A                                                val
0  A  {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1  B                                   {'B,A': ('AA',)}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM