[英]Pandas: Fill every row by its first and last occurrence
My data includes invoices and customers. 我的数据包括发票和客户。 One customer can have multiple invoices.
一位客户可以拥有多张发票。 One invoice belongs to always one customer.
一张发票始终属于一位客户。 The invoices are updated daily (Report Date).
发票每天更新(报告日期)。
My goal is to calculate the age of the customer in days (see column "Age in Days"). 我的目标是以天为单位计算客户的年龄(请参阅“天数”列)。 In order to achieve this, I take the first occurrence of a customers report date and calculate the difference to the last occurrence of the report date.
为了实现这一点,我采用了客户报告日期的第一个值,并计算了与报告日期最后一次的值的差。
eg Customer 1 occurs from 08-14 till 08-15. 例如,客户1从08-14到08-15发生。 Therefore he/she is 1 day old.
因此,他/她只有1天大。
Report Date Invoice No Customer No Amount Age in Days
2018-08-14 A 1 50$ 1
2018-08-14 B 1 100$ 1
2018-08-14 C 2 75$ 2
2018-08-15 A 1 20$ 1
2018-08-15 B 1 45$ 1
2018-08-15 C 2 70$ 2
2018-08-16 C 2 40$ 1
2018-08-16 D 3 100$ 0
2018-08-16 E 3 60$ 0
I solved this, but however, very inefficiently and it takes too long. 我解决了这个问题,但是效率很低,而且花费的时间太长。 My data contains 26 million rows.
我的数据包含2600万行。 Below I calculated the age for one customer only.
下面我仅计算了一位客户的年龄。
# List every customer no
customerNo = df["Customer No"].unique()
customer_age = []
# Testing for one specific customer
testCustomer = df.loc[df["Customer No"] == customerNo[0]]
testCustomer = testCustomer.sort_values(by="Report Date", ascending=True)
first_occur = testCustomer.iloc[0]['Report Date']
last_occur = testCustomer.iloc[-1]['Report Date']
age = (last_occur - first_occur).days
customer_age.extend([age] * len(testCustomer))
testCustomer.loc[:,'Customer Age']=customer_age
Is there a better way to solve this problem? 有解决这个问题的更好方法吗?
If you need one value per customer, indicating its age you can use a group by(very common): 如果您需要每个客户一个价值,说明其年龄,则可以使用分组依据(非常常见):
grpd = my_df.groupby('Customer No')['Report Date'].agg([min, max]).reset_index()
grpd['days_diff'] = (grpd['max'] - grpd['min']).dt.days
Use groupby.transform
with first
and last
aggregations: 将
groupby.transform
与first
和last
聚合一起使用:
grps = df.groupby('Customer No')['Report Date']
df['Age in Days'] = (grps.transform('last') - grps.transform('first')).dt.days
[out] [出]
Report Date Invoice No Customer No Amount Age in Days
0 2018-08-14 A 1 50$ 1
1 2018-08-14 B 1 100$ 1
2 2018-08-14 C 2 75$ 2
3 2018-08-15 A 1 20$ 1
4 2018-08-15 B 1 45$ 1
5 2018-08-15 C 2 70$ 2
6 2018-08-16 C 2 40$ 2
7 2018-08-16 D 3 100$ 0
8 2018-08-16 E 3 60$ 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.