简体   繁体   中英

Cumulative count at a group level Python

I have a pandas dataframe like this:

df = pd.DataFrame([
        ['A', 1234, 20120201],
        ['A', 1134, 20120201],
        ['A', 1011, 20120201],
        ['A', 1123, 20121004],
        ['A', 1111, 20121004],
        ['A', 1224, 20121105],
        ['B', 1156, 20120403],
        ['B', 2345, 20120504],
        ['B', 4567, 20120504],
        ['B', 8796, 20120606]
    ], columns = ['company', 'invoice', 'date'])

The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.

I tried the following

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')

But instead of number of invoices what I get is number of company - date combinations prior to the current record.


df = pd.DataFrame(
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 1.0],
        ['A', 1111, 20121004, 1.0],
        ['A', 1224, 20121105, 2.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 2.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Expected output:

df = pd.DataFrame(
        ['A', 1234, 20120201, 0.0],
        ['A', 1134, 20120201, 0.0],
        ['A', 1011, 20120201, 0.0],
        ['A', 1123, 20121004, 3.0],
        ['A', 1111, 20121004, 3.0],
        ['A', 1224, 20121105, 5.0],
        ['B', 1156, 20120403, 0.0],
        ['B', 2345, 20120504, 1.0],
        ['B', 4567, 20120504, 1.0],
        ['B', 8796, 20120606, 3.0]
    ], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Any suggestions to fix?

First, let's count the number of invoices paid on each day for each company:

tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')

Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum :

tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)

And finally, merge the calculation with the original dataframe:

df.merge(tmp2, left_on=['company', 'date'], right_index=True)

If you prefer method chaining:

result = (
    df.groupby(['company', 'date'])
        .apply(lambda s: s.cumsum() - s)
        .merge(df, how='right', left_index=True, right_on=['company', 'date'])

If your data is sorted, you can try:

df = df.merge(
    df.groupby(["company", "date"])
    .apply(lambda x: x.shift(1).fillna(0).cumsum())
    on=["date", "company"],
).rename(columns={0: "totalpaidinvoices"})


  company  invoice      date  totalpaidinvoices
0       A     1234  20120201                0.0
1       A     1134  20120201                0.0
2       A     1011  20120201                0.0
3       A     1123  20121004                3.0
4       A     1111  20121004                3.0
5       A     1224  20121105                5.0
6       B     1156  20120403                0.0
7       B     2345  20120504                1.0
8       B     4567  20120504                1.0
9       B     8796  20120606                3.0

I thought I was making it too complicated switching from cumcount to boolean indexing, but based on the other answers, it seems this is actually the most concise (and potentially efficient) solution:

for company in df.company.unique():
    df.loc[df.company==company, 'total_paid_invoices'] = df.date.apply(
        lambda x: df.loc[(df.date<x)&(df.company==company)].shape[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM