熊猫：在每一行中按其第一次出现和最后一次出现

Question

My data includes invoices and customers. 我的数据包括发票和客户。 One customer can have multiple invoices. 一位客户可以拥有多张发票。 One invoice belongs to always one customer. 一张发票始终属于一位客户。 The invoices are updated daily (Report Date). 发票每天更新（报告日期）。

My goal is to calculate the age of the customer in days (see column "Age in Days"). 我的目标是以天为单位计算客户的年龄（请参阅“天数”列）。 In order to achieve this, I take the first occurrence of a customers report date and calculate the difference to the last occurrence of the report date. 为了实现这一点，我采用了客户报告日期的第一个值，并计算了与报告日期最后一次的值的差。

eg Customer 1 occurs from 08-14 till 08-15. 例如，客户1从08-14到08-15发生。 Therefore he/she is 1 day old. 因此，他/她只有1天大。

Report Date  Invoice No   Customer No  Amount  Age in Days
2018-08-14   A            1            50$     1
2018-08-14   B            1            100$    1
2018-08-14   C            2            75$     2

2018-08-15   A            1            20$     1
2018-08-15   B            1            45$     1
2018-08-15   C            2            70$     2

2018-08-16   C            2            40$     1
2018-08-16   D            3            100$    0
2018-08-16   E            3            60$     0

I solved this, but however, very inefficiently and it takes too long. 我解决了这个问题，但是效率很低，而且花费的时间太长。 My data contains 26 million rows. 我的数据包含2600万行。 Below I calculated the age for one customer only. 下面我仅计算了一位客户的年龄。

# List every customer no
customerNo = df["Customer No"].unique()
customer_age = []

# Testing for one specific customer
testCustomer = df.loc[df["Customer No"] == customerNo[0]]
testCustomer = testCustomer.sort_values(by="Report Date", ascending=True)

first_occur = testCustomer.iloc[0]['Report Date']
last_occur = testCustomer.iloc[-1]['Report Date']
age = (last_occur - first_occur).days

customer_age.extend([age] * len(testCustomer))
testCustomer.loc[:,'Customer Age']=customer_age

Is there a better way to solve this problem? 有解决这个问题的更好方法吗？

Answer 1

If you need one value per customer, indicating its age you can use a group by(very common): 如果您需要每个客户一个价值，说明其年龄，则可以使用分组依据（非常常见）：

grpd = my_df.groupby('Customer No')['Report Date'].agg([min, max]).reset_index()
grpd['days_diff'] = (grpd['max'] - grpd['min']).dt.days

Answer 2

Use groupby.transform with first and last aggregations: 将groupby.transform与first和last聚合一起使用：

grps = df.groupby('Customer No')['Report Date']    
df['Age in Days'] = (grps.transform('last') - grps.transform('first')).dt.days

[out] [出]

  Report Date Invoice No  Customer No Amount  Age in Days
0  2018-08-14          A            1    50$            1
1  2018-08-14          B            1   100$            1
2  2018-08-14          C            2    75$            2
3  2018-08-15          A            1    20$            1
4  2018-08-15          B            1    45$            1
5  2018-08-15          C            2    70$            2
6  2018-08-16          C            2    40$            2
7  2018-08-16          D            3   100$            0
8  2018-08-16          E            3    60$            0

熊猫：在每一行中按其第一次出现和最后一次出现

问题描述

2 个解决方案

解决方案1
2 2019-07-30 12:05:18

解决方案2
2 已采纳 2019-07-30 12:06:47

熊猫：在每一行中按其第一次出现和最后一次出现

问题描述

2 个解决方案

解决方案1 2 2019-07-30 12:05:18

解决方案2 2 已采纳 2019-07-30 12:06:47

解决方案1
2 2019-07-30 12:05:18

解决方案2
2 已采纳 2019-07-30 12:06:47