組級別的累積計數 Python

Question

我有一個 pandas dataframe 像這樣：

df = pd.DataFrame([
        ['A', 1234, 20120201],
        ['A', 1134, 20120201],
        ['A', 1011, 20120201],
        ['A', 1123, 20121004],
        ['A', 1111, 20121004],
        ['A', 1224, 20121105],
        ['B', 1156, 20120403],
        ['B', 2345, 20120504],
        ['B', 4567, 20120504],
        ['B', 8796, 20120606]
    ], columns = ['company', 'invoice', 'date'])

目的是創建一個名為“TotalPaidInvoices”的新列，該列計算每條記錄之前支付的發票數量。

我嘗試了以下

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')

但我得到的不是發票數量，而是當前記錄之前的公司數量 - 日期組合。

Output：

df = pd.DataFrame(
    [
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 1.0],
        ['A', 1111, 20121004, 1.0],
        ['A', 1224, 20121105, 2.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 2.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

預期 output：

df = pd.DataFrame(
    [
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 3.0],
        ['A', 1111, 20121004, 3.0],
        ['A', 1224, 20121105, 5.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 3.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

有什么建議可以解決嗎？

Answer 1

首先，讓我們統計一下每家公司每天支付的發票數量：

tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')

然后對於每家公司，我們需要計算在當前期間之前支付了多少張發票。 這是cumsum的工作：

tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)

最后，將計算與原始 dataframe 合並：

df.merge(tmp2, left_on=['company', 'date'], right_index=True)

如果您更喜歡方法鏈接：

result = (
    df.groupby(['company', 'date'])
        .size()
        .groupby('company')
        .apply(lambda s: s.cumsum() - s)
        .to_frame('totalpaidinvoices')
        .merge(df, how='right', left_index=True, right_on=['company', 'date'])
)

Answer 2

如果您的數據已排序，您可以嘗試：

df = df.merge(
    df.groupby(["company", "date"])
    .size()
    .groupby(level=0)
    .apply(lambda x: x.shift(1).fillna(0).cumsum())
    .reset_index(),
    on=["date", "company"],
).rename(columns={0: "totalpaidinvoices"})
print(df)

印刷：

  company  invoice      date  totalpaidinvoices
0       A     1234  20120201                0.0
1       A     1134  20120201                0.0
2       A     1011  20120201                0.0
3       A     1123  20121004                3.0
4       A     1111  20121004                3.0
5       A     1224  20121105                5.0
6       B     1156  20120403                0.0
7       B     2345  20120504                1.0
8       B     4567  20120504                1.0
9       B     8796  20120606                3.0

Answer 3

我以為我從cumcount切換到 boolean 索引太復雜了，但是根據其他答案，這似乎實際上是最簡潔（並且可能有效）的解決方案：

for company in df.company.unique():
    df.loc[df.company==company, 'total_paid_invoices'] = df.date.apply(
        lambda x: df.loc[(df.date<x)&(df.company==company)].shape[0]
    )

組級別的累積計數 Python

問題描述

3 個解決方案

解決方案1
4 2021-04-03 16:04:52

解決方案2
2 2021-04-03 16:05:48

解決方案3
2 2021-04-03 16:31:07

組級別的累積計數 Python

問題描述

3 個解決方案

解決方案1 4 2021-04-03 16:04:52

解決方案2 2 2021-04-03 16:05:48

解決方案3 2 2021-04-03 16:31:07

解決方案1
4 2021-04-03 16:04:52

解決方案2
2 2021-04-03 16:05:48

解決方案3
2 2021-04-03 16:31:07