在带有groupby的时间序列列上使用Pandas .diff（）

Question

I have a CSV file of customer purchases in no particular order that I read into a Pandas Dataframe . 我有一个客户购买的CSV文件，没有按照我读入Pandas Dataframe特定顺序。 I'd like to add a column for each purchase and show how much time has passed since the last purchase, grouped by customer. 我想为每次购买添加一个列，并显示自上次购买以来已经过了多少时间，按客户分组。 I'm not sure where it's getting the differences, but they are much too large (even if in seconds). 我不确定它在哪里得到差异，但它们太大了（即使在几秒钟内）。

CSV: CSV：

Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015

Python: 蟒蛇：

import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
                         .diff()
                         .fillna('-')
                       )
print data

Output: 输出：

    Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                   -
5         2322    2015-02-01    2678400000000000
4         2322    2015-03-01    2419200000000000
0         4543    2015-01-01                   -
1         4543    2015-02-05    3024000000000000
2         4543    2015-03-15    328320000000000

Desired Output: 期望的输出：

   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  -
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  -
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

Answer 1

I think you can add to read_csv parameter parse_dates for parsing datetime , sort_values and last groupby with diff : 我想你可以添加read_csv参数parse_dates来解析datetime ， sort_values和last groupby with diff ：

import pandas as pd
import io

temp=u"""Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015"""
#after testing replace io.StringIO(temp) to filename
data = pd.read_csv(io.StringIO(temp), parse_dates=['Purchase Date'])

data.sort_values(by=['Customer Id', 'Purchase Date'], inplace=True)

data['Purchase Difference'] = data.groupby(['Customer Id'])['Purchase Date'].diff()
print data
   Customer Id Purchase Date  Purchase Difference
3         2322    2015-01-01                  NaT
5         2322    2015-02-01              31 days
4         2322    2015-03-01              28 days
0         4543    2015-01-01                  NaT
1         4543    2015-02-05              35 days
2         4543    2015-03-15              38 days

Answer 2

You can just apply diff to the Purchase Date column once it has been converted to a Timestamp. 一旦转换为时间戳，您就可以将diff应用于Purchase Date列。

df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)    
df['Purchase Difference'] = \
    [str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else "" 
     for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]

>>> df
   Customer Id Purchase Date Purchase Difference
3         2322    2015-01-01                    
5         2322    2015-02-01             31 days
4         2322    2015-03-01             28 days
0         4543    2015-01-01                    
1         4543    2015-02-05             35 days
2         4543    2015-03-15             38 days
6         4543    2015-03-15

在带有groupby的时间序列列上使用Pandas .diff（）

问题描述

2 个解决方案

解决方案1
4 2016-05-04 17:12:52

解决方案2
4 已采纳 2016-05-04 17:13:56

在带有groupby的时间序列列上使用Pandas .diff（）

问题描述

2 个解决方案

解决方案1 4 2016-05-04 17:12:52

解决方案2 4 已采纳 2016-05-04 17:13:56

解决方案1
4 2016-05-04 17:12:52

解决方案2
4 已采纳 2016-05-04 17:13:56