简体   繁体   中英

Groupby and take sum by levels for multi-categorical variable

I have this data:

ID  Page Time_on_page
1    A       60
1    B       80
2    C       120
2    C       30
3    A       10
3    B       50
3    C       60
3    B       30

And I have to group it by ID and take the sum of Time_on_page by each level of Page and related dummy variables (this is a simplied version, I have way more than 3 unique pages):

ID  Page_A  Page_B  Page_C  Time_on_page_A  Time_on_page_B  Time_on_page_C
1     1       1        0         60               80              0
2     0       0        1         0                 0              150
3     1       1        1         10                80              60

I tried with

pd.get_dummies(df, columns=cols, drop_first=False).groupby(['ID','Page'], as_index=False).sum()

But it's not working

Thanks for your help!

Here's a way using pd.pivot_table :

out = (pd.pivot_table(data=df, index=df.ID, columns=df.Page, aggfunc='sum')
        .add_prefix('Time_on_page_'))
out.columns = out.columns.droplevel(0)
df2 = out.notna().astype('i1')
df2.columns = df2.columns.str[-6:]
out.assign(**df2).fillna(0).astype(int)

Page  Time_on_page_A  Time_on_page_B  Time_on_page_C  page_A  page_B  page_C
ID                                                                          
1                 60              80               0       1       1       0
2                  0               0             150       0       0       1
3                 10              80              60       1       1       1

May be something like below using crosstab :

pd.crosstab(df.ID,df.Page,df.Page,aggfunc='nunique').fillna(0).add_prefix('Page_').join(
pd.crosstab(df.ID,df.Page,df.Time_on_page,aggfunc='sum')
    .add_prefix('Time_on_Page_').fillna(0))

Page  Page_A  Page_B  Page_C  Time_on_Page_A  Time_on_Page_B  Time_on_Page_C
ID                                                                          
1        1.0     1.0     0.0            60.0            80.0             0.0
2        0.0     0.0     1.0             0.0             0.0           150.0
3        1.0     1.0     1.0            10.0            80.0            60.0
df = pd.DataFrame({
        'ID': [1,1,2,2,3,3,3,3],
        'Page': [ 'A', 'B','C','C', 'A', 'B','C','B'],
        'Time_on_page' : [60,80,120,30,10,50,60,30]
    })

# Create Dummies
adf = pd.get_dummies(df, columns=['Page'], drop_first=False).groupby(['ID']).max().reset_index()

# Calculate ID, Page wise Time sums
bdf = df.groupby(['ID','Page'])['Time_on_page'].sum().unstack(['Page']).fillna(0).reset_index()

# Merge both
result = adf.merge(bdf, on=['ID']).drop('Time_on_page', axis=1)

print (result)
    ID      Page_A  Page_B  Page_C  A     B      C
    1        1      1       0      60.0   80.0   0.0
    2        0      0       1      0.0    0.0   150.0
    3        1      1       1      10.0   80.0  60.0
df1 = df.groupby(['ID', 'Page']).sum().reset_index()
pd.pivot_table(df1, 'Time_on_page', 'ID', 'Page', [len,sum], 0)

Result:

     len       sum         
Page   A  B  C   A   B    C
ID                         
1      1  1  0  60  80    0
2      0  0  1   0   0  150
3      1  1  1  10  80   60

Groupby ID , Page and agg on each column and unstack . Finally, flatten multiindex columns by map and join

df1 = df.groupby(['ID', 'Page']).agg({'Page': lambda x: 1, 'Time_on_page': 'sum'}) \
                                .unstack(fill_value=0)
df1.columns = df1.columns.map('_'.join)


Out[467]:
    Page_A  Page_B  Page_C  Time_on_page_A  Time_on_page_B  Time_on_page_C
ID
1        1       1       0              60              80               0
2        0       0       1               0               0             150
3        1       1       1              10              80              60

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM