简体   繁体   中英

Combine rows in a DataFrame & add the value as a column

My Dataframe looks like this:

campaign_name  campaign_id    event_name  clicks  installs  conversions
   campaign_1         1234  registration     100         5            1
   campaign_1         1234    hv_users_r     100         5            2
   campaign_2         2345  registration     500        10            3
   campaign_2         2345    hv_users_w     500        10            2
   campaign_3         3456  registration    1000        50           10
   campaign_4         3456    hv_users_r    1000        50           15
   campaign_4         3456    hv_users_w    1000        50           25

I want to categorize all the "event names" into 2 new columns, where 1st new columns represents "registration", and the 2nd new column represents "hv_users", which will the sum of all rows having event-names of "hv_users_r" & "hv_users_w".

To keep this simple - "registration" column will have rows which only have event_name as "registration". All non "registration" event_names would go into the new column "hv_users".

This is my expected new Dataframe:

campaign_name  campaign_id  clicks installs  registrations  hv_users
   campaign_1         1234     100        5              1         2
   campaign_2         2345     500       10              3         2
   campaign_3         3456    1000       50             10        40  

Can someone please give me directions on how to go from the input DataFrame to the output DataFrame?

You can using split + join , then groupby + unstack

df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
    groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
      unstack(fill_value=0).reset_index()
Out[302]: 
event_name ampaign_name  campaign_id  clicks  installs  hv_users  registration
0            campaign_1         1234     100         5         2             1
1            campaign_2         2345     500        10         2             3
2            campaign_3         3456    1000        50         0            10
3            campaign_4         3456    1000        50        40             0
df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)

pd.crosstab() and pd.pivot() should do the trick.

#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)    
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)

Try using pivot_table

df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)

    campaign_name   campaign_id clicks  installs    hv_users    registration
0   campaign_1      1234        100     5           2           1
1   campaign_2      2345        500     10          2           3
2   campaign_3      3456        1000    50          0           10
3   campaign_4      3456        1000    50          40          0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM