I have a dataframe containing session and bid data where there are three columns (of interest): user_id, event and date.
Now what I want to do is add a column to my dataframe that is the date of the first bid. I have tried several ways of getting this to work but the issue is that it is of course very common that the user generated a session before they made a bid.
I have tried in several ways to get a filter to work but it does not seem to work like I think it should. From the documentation it says "Return a copy of a DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by func." which sounds like what I want, ignore the events in the group that are session and not bid.
df['first bid date'] = df.groupby('user_id').filter(lambda x: x['event'] == 'bid')['date'].transform('min')
When this did not work I tried to instead have the transform take a custom function, like this:
def custom_transform(group):
return group[group['event'] == 'bid']['date'].min()
df['first bid date'] = df.groupby('user_id').['date'].transform(custom_transform)
But this does not work because the transform cannot access both the date and the event at the same time, seemingly no matter what I groupby.
Finally I tried to group by both the user_id and the event like this
df['first bid date'] = df.groupby(['user_id', 'event'])['date'].transform('min')
which kind of works but then I am left with having to change all of the first sessions to the first bid since there is now a first session and a first bid.
Any input to make this oneliner work? Seems like a combination of groupby, filter and transform should do the trick but I just can't crack it.
Thanks!
Idea is replace non matched values to missing values before transform
, here by Series.where
:
df['first bid date'] = (df.assign(date = df['date'].where(df['event'] == 'bid'))
.groupby('user_id')['date']
.transform('min'))
Here is some sample code with a dataframe to match the problem.
from io import StringIO
csv = StringIO("""index,uid,event,date
0,1,"bid",'2010-01-01'
1,1,"bid",'2013-01-01'
2,1,"session",'2009-01-01'
3,2,"session",'2010-01-01'
4,2,"bid",'2015-01-01'
5,2,"bid",'2017-01-01'""")
df = pd.read_csv(csv, index_col='index').reset_index(drop=True)
This alternate approach uses the merge
function.
df.merge(df[df['event']=='bid'].groupby('uid')['date'].min(),
on='uid', suffixes=('','_first_bid'))
Which prints:
uid event date date_first_bid
0 1 bid 2010-01-01 2010-01-01
1 1 bid 2013-01-01 2010-01-01
2 1 session 2009-01-01 2010-01-01
3 2 session 2010-01-01 2015-01-01
4 2 bid 2015-01-01 2015-01-01
5 2 bid 2017-01-01 2015-01-01
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.