I have a pandas dataframe like this:
df = pd.DataFrame([
['A', 1234, 20120201],
['A', 1134, 20120201],
['A', 1011, 20120201],
['A', 1123, 20121004],
['A', 1111, 20121004],
['A', 1224, 20121105],
['B', 1156, 20120403],
['B', 2345, 20120504],
['B', 4567, 20120504],
['B', 8796, 20120606]
], columns = ['company', 'invoice', 'date'])
The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.
I tried the following
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')
But instead of number of invoices what I get is number of company - date combinations prior to the current record.
Output:
df = pd.DataFrame(
[
['A', 1234, 20120201, 0.0],
['A', 1134, 20120201, 0.0],
['A', 1011, 20120201, 0.0],
['A', 1123, 20121004, 1.0],
['A', 1111, 20121004, 1.0],
['A', 1224, 20121105, 2.0],
['B', 1156, 20120403, 0.0],
['B', 2345, 20120504, 1.0],
['B', 4567, 20120504, 1.0],
['B', 8796, 20120606, 2.0]
], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Expected output:
df = pd.DataFrame(
[
['A', 1234, 20120201, 0.0],
['A', 1134, 20120201, 0.0],
['A', 1011, 20120201, 0.0],
['A', 1123, 20121004, 3.0],
['A', 1111, 20121004, 3.0],
['A', 1224, 20121105, 5.0],
['B', 1156, 20120403, 0.0],
['B', 2345, 20120504, 1.0],
['B', 4567, 20120504, 1.0],
['B', 8796, 20120606, 3.0]
], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Any suggestions to fix?
First, let's count the number of invoices paid on each day for each company:
tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')
Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum
:
tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)
And finally, merge the calculation with the original dataframe:
df.merge(tmp2, left_on=['company', 'date'], right_index=True)
If you prefer method chaining:
result = (
df.groupby(['company', 'date'])
.size()
.groupby('company')
.apply(lambda s: s.cumsum() - s)
.to_frame('totalpaidinvoices')
.merge(df, how='right', left_index=True, right_on=['company', 'date'])
)
If your data is sorted, you can try:
df = df.merge(
df.groupby(["company", "date"])
.size()
.groupby(level=0)
.apply(lambda x: x.shift(1).fillna(0).cumsum())
.reset_index(),
on=["date", "company"],
).rename(columns={0: "totalpaidinvoices"})
print(df)
Prints:
company invoice date totalpaidinvoices
0 A 1234 20120201 0.0
1 A 1134 20120201 0.0
2 A 1011 20120201 0.0
3 A 1123 20121004 3.0
4 A 1111 20121004 3.0
5 A 1224 20121105 5.0
6 B 1156 20120403 0.0
7 B 2345 20120504 1.0
8 B 4567 20120504 1.0
9 B 8796 20120606 3.0
I thought I was making it too complicated switching from cumcount
to boolean indexing, but based on the other answers, it seems this is actually the most concise (and potentially efficient) solution:
for company in df.company.unique():
df.loc[df.company==company, 'total_paid_invoices'] = df.date.apply(
lambda x: df.loc[(df.date<x)&(df.company==company)].shape[0]
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.