[英]How to calculate the difference between two cumsum columns using Pandas
我有以下數據框:
duid start_date end_date
0 b2919f1eb 2019-08-26 2019-09-05
1 e372dedd4 2019-08-26 NaT
2 ba8147ce9 2019-09-09 2019-11-05
3 902c56036 2019-09-13 2019-10-01
4 16ec096a7 2019-09-17 2019-10-02
5 1faac1a15 2019-09-17 NaT
6 319fb59f5 2019-09-24 2020-01-20
7 2a3f1dac5 2019-10-01 NaT
8 aecbcf0c5 2019-10-01 2019-11-05
9 0ee088b63 2019-10-08 2019-10-03
10 c0c02fa4c 2019-10-31 2019-10-31
12 aac5fbc7d 2019-11-05 2019-11-05
11 c76bc248a 2019-11-05 2019-11-29
13 20dcef410 2019-11-12 NaT
14 bc7ea631d 2019-11-12 NaT
15 786af275b 2019-11-12 2019-11-12
16 005ec00c8 2019-11-15 NaT
17 482462695 2019-11-19 NaT
18 ecba54e5d 2019-11-26 NaT
19 28490c52f 2019-12-17 NaT
20 02f2f7f4b 2020-01-15 NaT
21 0ea659d1a 2020-01-29 NaT
22 0b78caca1 2020-01-29 NaT
23 368cc8744 2020-01-29 2020-01-29
該表描述了員工的聘用和離職日期。 到目前為止,我已經設法計算出每月的計數:
df.groupby(df['start_date'].dt.strftime('%Y %B')) \
.agg(hired=('start_date', 'size'), left=('end_date', 'count')) \
.reset_index()
start_date hired left
0 2019 August 2 1
1 2019 December 1 0
2 2019 November 8 3
3 2019 October 4 3
4 2019 September 5 4
5 2020 January 4 1
此外,我試圖計算每個日期的累計總和,但它返回奇怪的結果
ds = df.groupby(df['start_date'].dt.strftime('%Y %B'))
ds.size().cumsum()
start_date
2019 August 2
2019 December 3
2019 November 11
2019 October 15
2019 September 20
2020 January 24
dtype: int64
和累積的左...
de = df.groupby(df['end_date'].dt.strftime('%Y %B'))
de.size().cumsum()
end_date
2019 November 5
2019 October 9
2019 September 10
2020 January 12
dtype: int64
有一個排序的事情,我不知道為什么表不按照start_date
排序,但是這個問題與計算兩個值之間的差異無關,即:
df = df.sort_values('start_date')
如何對兩列start_date
和end_date
的累積求和以獲得以下結果
start_date hired left rooster
0 2019 August 2 1 1
1 2019 September 5 4 2
2 2019 October 4 3 3
3 2019 November 8 3 8
4 2019 December 1 0 9
5 2020 January 4 1 12
您可能會發現將分組鍵保留為類似日期時間的對象,然后在最后重新格式化它以便排序正常工作更容易。 (所以 pd.Grouper 與 freq 或 .to_period(...) 等...)
首先獲取您的初始匯總數字並按分組索引排序,以便保證您的數據按排序順序排列:
agg = (
df.groupby(pd.Grouper(key='start_date', freq='M'))['end_date']
.agg(hired='size', left='count')
.sort_index()
)
然后為花名冊的運行總數分配一個新列......
agg['roster'] = agg['hired'].cumsum() - agg['left'].cumsum()
然后重新格式化您的索引並重置它,例如:
agg = agg.set_index(agg.index.strftime('%Y %B')).reset_index()
會給你:
start_date hired left roster
0 2019 August 2 1 1
1 2019 September 5 4 2
2 2019 October 4 3 3
3 2019 November 8 3 8
4 2019 December 1 0 9
5 2020 January 4 1 12
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.