[英]Pandas: Conditional Self-Join Based on Multiple Conditions
I am familiar with how to merge/join two Pandas dataframes like so:我熟悉如何合并/加入两个 Pandas 数据帧,如下所示:
result = pd.merge(user_usage,
user_device[['use_id', 'platform', 'device']],
on='use_id',
how='right')
However, I don't knopw how would I do a self-join of a table:但是,我不知道如何进行表的自联接:
id rank ts
1 1 2015-11-01
1 2 2015-11-03
1 3 2015-11-07
where I want the comparison of each id-rank's timestamp with the following one.我希望将每个 id-rank 的时间戳与以下时间戳进行比较。
In SQL and Scala syntax, this is easy.在 SQL 和 Scala 语法中,这很容易。 In SQL, I would just do something like (in pseudo-code):
在 SQL 中,我会做类似的事情(在伪代码中):
SELECT *
FROM df a
LEFT JOIN df b
ON a.id = b.id & (a.rank + 1) = b.rank;
In the pd.merge
syntax, I've never seen such an example and am still unable to find one.在
pd.merge
语法中,我从未见过这样的示例,并且仍然找不到。
To be clear, I'm looking for:为了清楚起见,我正在寻找:
id rank ts ts_2 time_since_previous_obs
1 1 2015-11-01 <null> 0
1 2 2015-11-03 2015-11-01 2
1 3 2015-11-07 2015-11-03 4
Is this possible with Python Pandas merge
or join
syntax?这是否可能与 Python Pandas
merge
或join
语法? Is there another smarter way?还有其他更聪明的方法吗?
Well, you can modify the rank before merge:好吧,您可以在合并之前修改排名:
(df.merge(df.assign(rank=df['rank'] - 1),
on=['id','rank'], how='left')
.assign(last_obs_since=lambda x: x['ts_y'] - x['ts_x'])
)
Output: Output:
id rank ts_x ts_y last_obs_since
0 1 1 2015-11-01 2015-11-02 1 days
1 1 2 2015-11-02 2015-11-03 1 days
2 1 3 2015-11-03 NaT NaT
Following should also work,以下也应该工作,
df['ts2'] = df.shift(1)['ts']
df['last_obs_since'] = df['ts'] - df['ts2']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.