如何在pandas數據幀中執行不同值的累積和

Question

我有這樣的數據幀：

id    date         company    ......
123   2019-01-01        A
224   2019-01-01        B
345   2019-01-01        B
987   2019-01-03        C
334   2019-01-03        C
908   2019-01-04        C
765   2019-01-04        A
554   2019-01-05        A
482   2019-01-05        D

我希望獲得“公司”專欄隨時間推移的唯一值的累計數量。 因此，如果公司稍后出現，則不再計算在內。

我的預期輸出是：

date            cumulative_count
2019-01-01      2
2019-01-03      3
2019-01-04      3
2019-01-05      4

我試過了：

df.groupby(['date']).company.nunique().cumsum()

但是，如果同一家公司出現在不同的日期，這個雙重計算。

Answer 1

使用duplicated + cumsum + last

m = df.duplicated('company')
d = df['date']

(~m).cumsum().groupby(d).last()

date
2019-01-01    2
2019-01-03    3
2019-01-04    3
2019-01-05    4
dtype: int32

Answer 2

另一種嘗試修復anky_91的方法

(df.company.map(hash)).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
Out[196]: 
date
2019-01-01    2.0
2019-01-03    3.0
2019-01-04    3.0
2019-01-05    4.0
Name: company, dtype: float64

來自anky_91

(df.company.astype('category').cat.codes).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()

Answer 3

這需要比anky的答案更多的代碼，但仍然適用於示例數據：

df = df.sort_values('date')
(df.drop_duplicates(['company'])
   .groupby('date')
   .size().cumsum()
   .reindex(df['date'].unique())
   .ffill()
)

輸出：

date
2019-01-01    2.0
2019-01-03    3.0
2019-01-04    3.0
2019-01-05    4.0
dtype: float64

如何在pandas數據幀中執行不同值的累積和

問題描述

3 個解決方案

解決方案1
8 已采納 2019-09-05 14:29:02

解決方案2
2 2019-09-05 14:58:01

解決方案3
1 2019-09-05 14:22:44

如何在pandas數據幀中執行不同值的累積和

問題描述

3 個解決方案

解決方案1 8 已采納 2019-09-05 14:29:02

解決方案2 2 2019-09-05 14:58:01

解決方案3 1 2019-09-05 14:22:44

解決方案1
8 已采納 2019-09-05 14:29:02

解決方案2
2 2019-09-05 14:58:01

解決方案3
1 2019-09-05 14:22:44