简体   繁体   中英

Finding first event in one table that occurred after event in a second table, per row

I have a pandas DataFrame that looks like this:

email    signup_date    
a@a.com   7/21/16
b@b.com   6/6/16
d@d.com   5/5/16
b@b.com   4/4/16

I have a second pandas DataFrame with related events, when a signup actually got followed through on, that looks like this:

email    call_date    
a@a.com   7/25/16
b@b.com   6/20/16
b@b.com   5/4/16

There are a few things to keep in mind.

  • Some of the signup_date s may not have any later call_date for the corresponding email. That is, some people sign up but never get a call back.
  • Some people signup more than once and get called after each signup.
  • Some people signup more than once but don't get called after each signup.
  • Some people signup more than once but may only get called after several signups. In this case we want to attribute the call to the most recent signup.

Ultimately the goal is to determine whether there is a call event that came after a sign up event for a given user but before the next sign up event for the same user, where the number of signup events per user is not known in advance.

Is there a pandas best practices way to do this? For now I'm using a for loop, and it's extremely slow (hasn't finished on 100,000 rows even after 20 minutes):

response_date = []
for i in range(signups.shape[0]):
    unique_id = signups.unique_id.values[i]
    start_date = signups.signup_date.values[i]
    end_date = signups.signup_date.values[-1]
    if end_date is start_date:
        end_date = end_date + pd.Timedelta('1 year')
    tmp_df = calls[calls.unique_id == unique_id]
    tmp_df = tmp_df[tmp_df.timestamp > start_date][tmp_df.timestamp < end_date]
    tmp_df = tmp_df.sort_values('timestamp')
    if tmp_df.shape[0] > 0 :
        response_date.append(tmp_df.timestamp.values[0])
    else:
        response_date.append(None)

Thanks for any advice!

setup

from StringIO import StringIO
import pandas as pd

txt1 = """email    signup_date    
a@a.com   7/21/16
b@b.com   6/6/16
d@d.com   5/5/16
b@b.com   4/4/16"""

df1 = pd.read_csv(StringIO(txt1), parse_dates=[1], delim_whitespace=True)

txt2 = """email    call_date    
a@a.com   7/25/16
b@b.com   6/20/16
b@b.com   5/4/16"""

df2 = pd.read_csv(StringIO(txt2), parse_dates=[1], delim_whitespace=True)

merge
combine df1 and df2 on email

df = df1.merge(df2)
df

在此处输入图片说明

filter 1
get rid of rows where call_date is prior to signup_date

cond1 = df.signup_date.le(df.call_date)
cond1

0     True
1     True
2    False
3     True
4     True
dtype: bool

df = df[cond1]
df

在此处输入图片说明

filter 2
groupby ['email', 'call_date'] and get most recent with idxmax

most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
most_recent

call_date   email  
2016-05-04  b@b.com    4
2016-06-20  b@b.com    1
2016-07-25  a@a.com    0
Name: signup_date, dtype: int64

result

df.ix[most_recent]

在此处输入图片说明


all together

df = df1.merge(df2)
cond1 = df.signup_date.le(df.call_date)
df = df[cond1]
most_recent = df.groupby(['call_date', 'email']).signup_date.idxmax()
df.ix[most_recent]

在此处输入图片说明

Another solution with sort_values and aggregate first :

df = df1.merge(df2)
df = df[df.signup_date <= df.call_date]
print (df.sort_values("signup_date", ascending=False)
         .groupby(['call_date', 'email'], as_index=False)
         .first())

   call_date    email signup_date
0 2016-05-04  b@b.com  2016-04-04
1 2016-06-20  b@b.com  2016-06-06
2 2016-07-25  a@a.com  2016-07-21

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM