简体   繁体   中英

Concat dataframes/series with axis=1 in a loop

I have a dataframe of email senders as follows. I am trying to get as an output a dataframe which is the number of emails send by eac person by month. I want the index to be the month end and the columns to be the persons. I am able to build this but with two issues:

First, I: am using multiple pd.concat statements (all the df_temps) which is ugly and does not scale. Is there a way to put this in a for loop or some other way to loop over say the first n persons?

Second, while it puts all the data together correctly, there is a discontinuity in the index. The second last row is 1999-01-31 and the last one is 2000-01-31. Is there an option or a way to get NaN for the in between months?

Code below:

import pandas as pd

df_in = pd.DataFrame({
'sender':['Able Boy','Able Boy','Able Boy','Mark L. Taylor','Mark L. Taylor',
    'Mark L. Taylor','scott kirk','scott kirk','scott kirk','scott kirk',
    'Able Boy','Able Boy','james h. madison','james h. madison','james h. madison',
    'james joyce','scott kirk','james joyce','james joyce','james joyce',
    'james h. madison','Able Boy'],
'receiver':['Toni Z. Zapata','Mark Angel','scott kirk','paul a boyd','michelle fam',
    'debbie bradford','Mark Angel','Johnny C. Cash','Able Boy','Mark L. Taylor',
    'jenny chang','julie s. smith', 'scott kirk', 'tiffany r.','Able Boy',
    'Mark Angel','Able Boy','julie s. smith','jenny chang','debbie bradford',
    'Able Boy','Toni Z. Zapata'],
'time':[911929000000,911929000000,910228000000,911497000000,911497000000,
    911932000000,914261000000,914267000000,914269000000,914276000000,
    914932000000,915901000000,916001000000,916001000000,916001000000,
    947943000000,947943000000,947943000000,947943000000,947943000000,
    916001000000,911929100000],
'email_ID':['<A34E5R>','<A34E5R>','<B34E5R>','<C34E5R>','<C34E5R>',
    '<C36E5R>','<C36E5A>','<C36E5B>','<C36E5C>','<C36E5D>',
    '<D000A0>','<D000A1>','<D000A2>','<D000A2>','<D000A2>',
    '<D000A3>','<D000A3>','<D000A3>','<D000A3>','<D000A3>',
    '<D000A4>','<A34E5S>']
})
df_in['time'] = pd.to_datetime(df_in['time'],unit='ms')

df_1 = df_in.copy()
df_1['number'] = 1

df_2 = df_1.drop_duplicates(subset="email_ID",keep="first",inplace=False)\
        .reset_index()

df_3 = df_2.drop(columns=['index','receiver','email_ID'],inplace=False)

df_6 = df_3.groupby(['sender',pd.Grouper(key='time',freq='M')]).sum()

df_6_squeezed = df_6.squeeze()

df_grp_1 = df_3.groupby(['sender']).count()
df_grp_1.sort_values(by=['number'],ascending=False,inplace=True)

toppers = list(df_grp_1.index.array)

df_temp_1 = df_6_squeezed[toppers[0]]
df_temp_2 = df_6_squeezed[toppers[1]]
df_temp_3 = df_6_squeezed[toppers[2]]
df_temp_4 = df_6_squeezed[toppers[3]]
df_temp_5 = df_6_squeezed[toppers[4]]

df_temp_1.rename(toppers[0],inplace=True)
df_temp_2.rename(toppers[1],inplace=True)
df_temp_3.rename(toppers[2],inplace=True)
df_temp_4.rename(toppers[3],inplace=True)
df_temp_5.rename(toppers[4],inplace=True)

df_concat_1 = pd.concat([df_temp_1,df_temp_2],axis=1,sort=False)
df_concat_2 = pd.concat([df_concat_1,df_temp_3],axis=1,sort=False)
df_concat_3 = pd.concat([df_concat_2,df_temp_4],axis=1,sort=False)
df_concat_4 = pd.concat([df_concat_3,df_temp_5],axis=1,sort=False)
print("\nCONCAT  (df_concat_4):")
print(df_concat_4)
print(type(df_concat_4))

Consider pivot_table after calculating month_end (see @Root's answer ). Also, use reindex to fill in missing months. Usually in Pandas, grouping aggregations like count of senders per month does not require looping or temporary helper data frames.

from pandas.tseries.offsets import MonthEnd

df_in['month_end'] = (df_in['time'] + MonthEnd(0)).dt.normalize()

agg_df = (df_in.pivot_table(index='month_end', columns='sender', values='time', aggfunc='count')
               .reindex(pd.date_range('1998-01-01', '2000-01-31', freq='m').values, axis='index')
               .fillna(0)                
          )

Output

print(agg_df)  
# sender      Able Boy  Mark L. Taylor  james h. madison  james joyce  scott kirk
# month_end                                                                      
# 1998-01-31       0.0             0.0               0.0          0.0         0.0
# 1998-02-28       0.0             0.0               0.0          0.0         0.0
# 1998-03-31       0.0             0.0               0.0          0.0         0.0
# 1998-04-30       0.0             0.0               0.0          0.0         0.0
# 1998-05-31       0.0             0.0               0.0          0.0         0.0
# 1998-06-30       0.0             0.0               0.0          0.0         0.0
# 1998-07-31       0.0             0.0               0.0          0.0         0.0
# 1998-08-31       0.0             0.0               0.0          0.0         0.0
# 1998-09-30       0.0             0.0               0.0          0.0         0.0
# 1998-10-31       0.0             0.0               0.0          0.0         0.0
# 1998-11-30       4.0             3.0               0.0          0.0         0.0
# 1998-12-31       1.0             0.0               0.0          0.0         4.0
# 1999-01-31       1.0             0.0               4.0          0.0         0.0
# 1999-02-28       0.0             0.0               0.0          0.0         0.0
# 1999-03-31       0.0             0.0               0.0          0.0         0.0
# 1999-04-30       0.0             0.0               0.0          0.0         0.0
# 1999-05-31       0.0             0.0               0.0          0.0         0.0
# 1999-06-30       0.0             0.0               0.0          0.0         0.0
# 1999-07-31       0.0             0.0               0.0          0.0         0.0
# 1999-08-31       0.0             0.0               0.0          0.0         0.0
# 1999-09-30       0.0             0.0               0.0          0.0         0.0
# 1999-10-31       0.0             0.0               0.0          0.0         0.0
# 1999-11-30       0.0             0.0               0.0          0.0         0.0
# 1999-12-31       0.0             0.0               0.0          0.0         0.0
# 2000-01-31       0.0             0.0               0.0          4.0         1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM