I am familiar with how to merge/join two Pandas dataframes like so:
result = pd.merge(user_usage,
user_device[['use_id', 'platform', 'device']],
on='use_id',
how='right')
However, I don't knopw how would I do a self-join of a table:
id rank ts
1 1 2015-11-01
1 2 2015-11-03
1 3 2015-11-07
where I want the comparison of each id-rank's timestamp with the following one.
In SQL and Scala syntax, this is easy. In SQL, I would just do something like (in pseudo-code):
SELECT *
FROM df a
LEFT JOIN df b
ON a.id = b.id & (a.rank + 1) = b.rank;
In the pd.merge
syntax, I've never seen such an example and am still unable to find one.
To be clear, I'm looking for:
id rank ts ts_2 time_since_previous_obs
1 1 2015-11-01 <null> 0
1 2 2015-11-03 2015-11-01 2
1 3 2015-11-07 2015-11-03 4
Is this possible with Python Pandas merge
or join
syntax? Is there another smarter way?
Well, you can modify the rank before merge:
(df.merge(df.assign(rank=df['rank'] - 1),
on=['id','rank'], how='left')
.assign(last_obs_since=lambda x: x['ts_y'] - x['ts_x'])
)
Output:
id rank ts_x ts_y last_obs_since
0 1 1 2015-11-01 2015-11-02 1 days
1 1 2 2015-11-02 2015-11-03 1 days
2 1 3 2015-11-03 NaT NaT
Following should also work,
df['ts2'] = df.shift(1)['ts']
df['last_obs_since'] = df['ts'] - df['ts2']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.