简体   繁体   English

在 Pandas GroupBy object 中减去两列

[英]Subtracting two columns within a Pandas GroupBy object

I have a dataset with marketing campaigns, where each house receive campaign actions like "flyer", or "call".我有一个包含营销活动的数据集,每个房子都会收到诸如“传单”或“电话”之类的活动操作。 Each action has it's own creation and end date.每个动作都有自己的创建和结束日期。 Some houses have only 1 action, and some have a couple.有些房子只有一个动作,有些有几个。

What I want to do is:我想做的是:

I want to calculate the length of the campaign for each house, so the time between the first action (eg flyer) and the last recorded action for each house.我想计算每所房子的活动长度,即每所房子的第一个动作(例如传单)和最后记录的动作之间的时间。 If each house had only 1 action, I could easily solve this by subtracting the end date column with the start date column.如果每个房子只有 1 个动作,我可以通过用开始日期列减去结束日期列来轻松解决这个问题。

Because some houses have multiple actions, I figured I could group all the houses with the Pandas GroupBy function.因为有些房子有多个动作,我想我可以用 Pandas GroupBy function 对所有房子进行分组。 Does anyone know how to subtract within a groupby object?有谁知道如何通过 object 在一个组中减去?

Data looks like this:数据如下所示:

house1 flyer 01-12-2014 05-12-2014
house1 phonecall 06-12-2014 06-12-2014
house2 flyer 01-12-2014 31-12-2014

my expected output looks like this:我预期的 output 看起来像这样:

house1 ; 5 days
house2 ; 30 days
house3 ; 12 days
house4 ; 60 days
etc

Simply use the agg function on groups:只需在组上使用agg function:

t = df.groupby("house").agg({"start": min, "end": max})
t["duration"] = t.end - t.start

The result is:结果是:

            start        end duration
house                                
house1 2014-01-12 2014-06-12 151 days
house2 2014-01-12 2014-12-31 353 days

Edit - creating the dataframe编辑 - 创建 dataframe

Per a question in one of the comments, here's how I created the dataframe:根据其中一条评论中的问题,这是我创建 dataframe 的方式:

data = """house1 flyer 01-12-2014 05-12-2014
house1 phonecall 06-12-2014 06-12-2014
house2 flyer 01-12-2014 31-12-2014"""

df = pd.read_csv(StringIO(data), sep = "\s+", 
                 header = None, 
                 names = ["house", "medium", "start", "end"])

# Make sure 'start' and 'end' are dates. 
df.end = pd.to_datetime(df.end)
df.start = pd.to_datetime(df.start)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM