简体   繁体   中英

pandas - group and count nunique values

I have this kind of dataframe df:

User,C,G
111,ar,1
112,es,1
112,es,1
112,es,2
113,es,2
113,es,3
113,es,3
114,es,4

What I would like to return as output is:

G,nU,ar,es
1,2,1,1
2,2,0,2
3,1,0,1
4,1,0,1

Basically, for each G , I'm counting the number of different User inside it on the nU column and the occurrences of the strings in C . Each User has a unique C value. For instance, in the G number 1 I have two Users (111 and 112), with one occurrence in 'ar' and one in 'es' (no matter if there are two 112 occurrences, I just need the (112,'es') single couple). Summing up the 'ar' and 'es' columns should return the nU column. So far I tried this:

d = df.reset_index().groupby('G')['User'].nunique()

which correctly returns the count of Users but no information about the C column.

Sorry for the confusion this might cause.

Given df ,

result = df.groupby(['G', 'User'])['C'].value_counts()

yields

G  User    
1  111   ar    1
   112   es    2
2  112   es    1
   113   es    1
3  113   es    2
4  114   es    1
dtype: int64

This counts each occurrence of ar and es . We really only want to count unique occurrences, so let's set each value in the Series to 1:

result[:] = 1

so that result looks like

G  User    
1  111   ar    1
   112   es    1
2  112   es    1
   113   es    1
3  113   es    1
4  114   es    1
dtype: int64

Now if we group by the first and last index levels (the G values and the C values), and sum each group,

result = result.groupby(level=['G',-1]).sum()

we get

G    
1  ar    1
   es    1
2  es    2
3  es    1
4  es    1
dtype: int64

Now we can unstack the last index level:

result = result.unstack()

to obtain

   ar  es
G        
1   1   1
2 NaN   2
3 NaN   1
4 NaN   1

Fill the NaNs with zeros:

result = result.fillna(0)

Define the nU column and the sum of the rows:

result['nU'] = result.sum(axis=1)

and reorder the columns:

result = result[['nU', 'ar', 'es']]

Putting it all together:

import pandas as pd
df = pd.read_csv('data')
result = df.groupby(['G', 'User'])['C'].value_counts()
result[:] = 1
result = result.groupby(level=['G',-1]).sum()
result = result.unstack()
result = result.fillna(0)
result['nU'] = result.sum(axis=1)
result = result[['nU', 'ar', 'es']]

yields

   nU  ar  es
G            
1   2   1   1
2   2   0   2
3   1   0   1
4   1   0   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM