[英]Python Aggregation on time-series
我有這樣的數據幀df
project_ID country prj_start prj_end revenue profit
2131 USA 201603 201703 100000 30000
5124 UK 201502 201606 1500 1000
1245 UK 201010 201710 1800 1000
我想找到每月和每個國家/地區的活躍項目數量,並總結其收入和利潤。 輸出看起來像這樣
Month country active_projects revenue profit
201603 USA 15 500000 100000
201603 UK 20 150000 100000
201604 Germany 30 1000000 500000
我的第一個編程語言是C ++,所以我傾向於使用循環來做事。 我幾乎成功地解決了我創建像這樣的月份插槽的問題。
#making a monthlist dataframe with count column to hold no. of active projects
monthlist = pd.DataFrame(columns= ["months","count"])
#making a new dataframe to insert the results into
newdf = pd.DataFrame(columns=["month", "country","active_prj_count","rev","gp"])
#making the month slots, not concerned with future values
monthlist['months']=pd.date_range(start = min(df['prj_start']), end =datetime.date.today(), freq='M').map(lambda x: 100*x.year + x.month)
monthlist['count']=0
#traversing through the original dataframe and monthlist to insert a new row into newdf
#everytime the project start is less than and prj end is greater than the month slot
i=0
for y in range(len(df)):
for x in range(len(monthlist)):
if(df.loc[y,'prj_start']<=monthlist.loc[x,'months'] & df.loc[y,'prj_end']>=monthlist.loc[x,'months']):
monthlist.loc[x,'count']=monthlist.loc[x,'count']+1
newdf.loc[i] = [monthlist.loc[x,'months'],df.loc[y,'country']
,monthlist.loc[x,'count'],df.loc[y,'revenue'],df.loc[y,'profit']]
i=i+1
這個解決方案有效,但我必須承認它不是非常智能和計算效率。 需要一段時間來處理。 有沒有想過通過使用pandas或numpy函數改進代碼的人?
好吧,這樣的事情怎么樣(取決於你如何計算每月利潤,只是一個例子):
d={'projectid':[2131,5124,1245],'country':['USA', 'UK', 'UK'],'pr_start':['2016-03','2015-02','2010-10'],'pr_end':['2017-03','2016-06','2017-10'], 'total_revenue':[100000, 1500, 1800], 'total_profit':[30000, 1000, 1000]}
df = pd.DataFrame(data=d)
df['pr_end'] = pd.to_datetime(df['pr_end'])
df['pr_start'] = pd.to_datetime(df['pr_start'])
df['project_length'] = df['pr_end'].dt.to_period('M') - df['pr_start'].dt.to_period('M')
df['monthly_revenue'] = df['total_revenue'] / df['project_length']
df['monthly_profit'] = df['total_profit'] / df['project_length']
for (idx, row) in df.iterrows():
if row.project_length > 1:
df.loc[idx, 'pr_end'] = df.loc[idx, 'pr_start'] + pd.DateOffset(months=1)
for i in range(1, row.project_length):
df2 = pd.DataFrame([row])
df2['pr_start'] = row.pr_start + pd.DateOffset(months=i)
df2['pr_end'] = row.pr_start + pd.DateOffset(months=i+1)
df = df.append(df2)
df = df.sort_values(by='pr_start').sort_index(kind = 'mergesort')
print(df.groupby(['pr_start','country']).agg({'projectid':'count', 'monthly_revenue': 'sum', 'monthly_profit': 'sum'}).rename(columns={'projectid':'Active Projects'}))
您可以將函數應用於每一行並提取每個項目所在的日期,然后按月和國家進行匯總。
>>> df
project_ID country prj_start prj_end revenue profit
0 2131 USA 201603 201703 100000 30000
1 5124 UK 201502 201606 1500 1000
2 1245 UK 201010 201710 1800 1000
讓我們每月添加更多樣本以包含不同的國家/地區:
>>> df_new = pd.DataFrame([
[1111, 'Germany',201603, 201703,1000, 4000],
[4111, 'Germany',201603, 201703,4000, 6000],
[3112, 'Germany',201010, 201703,4000, 6000],
[2112, 'Germany',201603, 201703,4000, 6000],
[2116, 'Germany',201502, 201710,4000, 6000]],
columns=df.columns)
>>> df_new
project_ID country prj_start prj_end revenue profit
0 1111 Germany 201603 201703 1000 4000
1 4111 Germany 201603 201703 4000 6000
2 3112 Germany 201010 201703 4000 6000
3 2112 Germany 201603 201703 4000 6000
4 2116 Germany 201502 201710 4000 6000
>>> df_ = pd.concat([df,df_new],axis=0,ignore_index=True)
project_ID country prj_start prj_end revenue profit
0 2131 USA 201603 201703 100000 30000
1 5124 UK 201502 201606 1500 1000
2 1245 UK 201010 201710 1800 1000
3 1111 Germany 201603 201703 1000 4000
4 4111 Germany 201603 201703 4000 6000
5 3112 Germany 201010 201703 4000 6000
6 2112 Germany 201603 201703 4000 6000
7 2116 Germany 201502 201710 4000 6000
將prj_start
和prj_end
轉換為datetime並指示要解析的格式format="%Y%m"
:
>>> df_[['prj_start','prj_end']] = df_[['prj_start','prj_end']].apply(pd.to_datetime, format="%Y%m")
>>> df_
project_ID country prj_start prj_end revenue profit
0 2131 USA 2016-03-01 2017-03-01 100000 30000
1 5124 UK 2015-02-01 2016-06-01 1500 1000
2 1245 UK 2010-10-01 2017-10-01 1800 1000
3 1111 Germany 2016-03-01 2017-03-01 1000 4000
4 4111 Germany 2016-03-01 2017-03-01 4000 6000
5 3112 Germany 2010-10-01 2017-03-01 4000 6000
6 2112 Germany 2016-03-01 2017-03-01 4000 6000
7 2116 Germany 2015-02-01 2017-10-01 4000 6000
現在讓我們定義一個函數來轉換行並應用它:
def transform_row(row):
date_index = pd.date_range(row['prj_start'].min(),
row['prj_end'].max(), freq='MS')
row_out = pd.DataFrame(np.repeat(row.values,
len(date_index.values),axis=0),
index=date_index, columns=row.columns)
row_out.index.name = 'date'
return row_out.reset_index()
df_transformed = pd.concat([transform_row(row.to_frame().T)
for i,row in df_.iterrows()],axis=0)
然后,最后應用pivot_table
按國家/地區和日期匯總值:
df1 = pd.pivot_table(df_transformed,
index=['date','country'],
values=['revenue','profit'],
aggfunc=np.sum,fill_value=0)
df2 = pd.pivot_table(df_transformed,
index=['date','country'],
values=['project_ID'],
aggfunc=len,fill_value=0)
最后,連接datafame以按月獲取數據:
pd.concat([df1,df2],axis=1)
profit revenue project_ID
date country
2010-10-01 Germany 6000 4000 1
UK 1000 1800 1
2010-11-01 Germany 6000 4000 1
UK 1000 1800 1
2010-12-01 Germany 6000 4000 1
UK 1000 1800 1
2011-01-01 Germany 6000 4000 1
UK 1000 1800 1
2011-02-01 Germany 6000 4000 1
UK 1000 1800 1
2011-03-01 Germany 6000 4000 1
UK 1000 1800 1
2011-04-01 Germany 6000 4000 1
UK 1000 1800 1
2011-05-01 Germany 6000 4000 1
UK 1000 1800 1
2011-06-01 Germany 6000 4000 1
UK 1000 1800 1
2011-07-01 Germany 6000 4000 1
UK 1000 1800 1
2011-08-01 Germany 6000 4000 1
UK 1000 1800 1
2011-09-01 Germany 6000 4000 1
UK 1000 1800 1
2011-10-01 Germany 6000 4000 1
UK 1000 1800 1
2011-11-01 Germany 6000 4000 1
UK 1000 1800 1
2011-12-01 Germany 6000 4000 1
UK 1000 1800 1
... ... ... ...
2016-10-01 USA 30000 100000 1
2016-11-01 Germany 28000 17000 5
UK 1000 1800 1
USA 30000 100000 1
2016-12-01 Germany 28000 17000 5
UK 1000 1800 1
USA 30000 100000 1
2017-01-01 Germany 28000 17000 5
UK 1000 1800 1
USA 30000 100000 1
2017-02-01 Germany 28000 17000 5
UK 1000 1800 1
USA 30000 100000 1
2017-03-01 Germany 28000 17000 5
UK 1000 1800 1
USA 30000 100000 1
2017-04-01 Germany 6000 4000 1
UK 1000 1800 1
2017-05-01 Germany 6000 4000 1
UK 1000 1800 1
2017-06-01 Germany 6000 4000 1
UK 1000 1800 1
2017-07-01 Germany 6000 4000 1
UK 1000 1800 1
2017-08-01 Germany 6000 4000 1
UK 1000 1800 1
2017-09-01 Germany 6000 4000 1
UK 1000 1800 1
2017-10-01 Germany 6000 4000 1
UK 1000 1800 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.