简体   繁体   English

使用pandas在Python中为每个客户找到两个最新日期

[英]Find the two most recent dates for each customer in Python using pandas

I have a pandas dataframe with purchase date of of each customer. 我有一个熊猫数据框,其中包含每个客户的购买日期。 I want to find out most recent purchase date and second most recent purchase date of each unique customer. 我想找出每个唯一客户的最近购买日期和第二个最近购买日期。 Here is my dataframe: 这是我的数据框:

   name    date
    ab1     6/1/18
    ab1     6/2/18
    ab1     6/3/18
    ab1     6/4/18
    ab2     6/8/18
    ab2     6/9/18
    ab3     6/23/18

I am expecting the following output: 我期望以下输出:

name    second most recent date        most recent date
ab1      6/3/18                         6/4/18
ab2      6/8/18                         6/9/18
ab3      6/23/18                        6/23/18

I know data['date'].max() can give the most recent purchase date but I don't have any idea how I can find the second most recent date. 我知道data['date'].max()可以给出最近的购买日期,但是我不知道如何找到最近的购买日期。 Any help will be highly appreciated. 任何帮助将不胜感激。

To get the two most recent purchase date for each customer, you can first sort your dataframe in descending order by date, then groupby the name and convert the aggregated dates into individual columns. 要获取每个客户的两个最近的购买日期,您可以先按日期降序对数据框进行排序,然后对名称进行分组,然后将汇总的日期转换为单独的列。 Finally just take the first two of these columns and you'll have just the two most recent purchase dates for each customer. 最后,只需获取这些列中的前两列,您就可以获得每个客户的两个最近购买日期。

Here's an example: 这是一个例子:

import pandas as pd

# set up data from your example
df = pd.DataFrame({
    "name": ["ab1", "ab1", "ab1", "ab1", "ab2", "ab2", "ab3"],
    "date": ["6/1/18", "6/2/18", "6/3/18", "6/4/18", "6/8/18", "6/9/18", "6/23/18"]
})

# create column of datetimes (for sorting reverse-chronologically)
df["datetime"] = pd.to_datetime(df.date)

# group by name and convert dates into individual columns
grouped_df = df.sort_values(
    "datetime", ascending=False
).groupby("name")["date"].apply(list).apply(pd.Series).reset_index()
# truncate and rename columns
grouped_df = grouped_df[["name", 0, 1]]
grouped_df.columns = ["name", "most_recent", "second_most_recent"]

With grouped_df like this at the end: 最后使用grouped_df

  name most_recent second_most_recent
0  ab1      6/4/18             6/3/18
1  ab2      6/9/18             6/8/18
2  ab3     6/23/18                NaN

If you want to fill any missing second_most_recent values with the corresponding most_recent value, you can use np.where . 如果你想以填补任何缺失second_most_recent相应的值most_recent值,你可以使用np.where Like this: 像这样:

import numpy as np

grouped_df["second_most_recent"] = np.where(
    grouped_df.second_most_recent.isna(),
    grouped_df.most_recent,
    grouped_df.second_most_recent
)

With result: 结果:

  name most_recent second_most_recent
0  ab1      6/4/18             6/3/18
1  ab2      6/9/18             6/8/18
2  ab3     6/23/18            6/23/18

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM