简体   繁体   English

Pandas:基于多个条件的条件自加入

[英]Pandas: Conditional Self-Join Based on Multiple Conditions

I am familiar with how to merge/join two Pandas dataframes like so:我熟悉如何合并/加入两个 Pandas 数据帧,如下所示:

result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', 
                 how='right')

However, I don't knopw how would I do a self-join of a table:但是,我不知道如何进行表的自联接:

id    rank   ts
1     1      2015-11-01
1     2      2015-11-03
1     3      2015-11-07

where I want the comparison of each id-rank's timestamp with the following one.我希望将每个 id-rank 的时间戳与以下时间戳进行比较。

In SQL and Scala syntax, this is easy.在 SQL 和 Scala 语法中,这很容易。 In SQL, I would just do something like (in pseudo-code):在 SQL 中,我会做类似的事情(在伪代码中):

SELECT *
FROM df a
LEFT JOIN df b
ON a.id = b.id & (a.rank + 1) = b.rank;

In the pd.merge syntax, I've never seen such an example and am still unable to find one.pd.merge语法中,我从未见过这样的示例,并且仍然找不到。

To be clear, I'm looking for:为了清楚起见,我正在寻找:

id    rank   ts           ts_2         time_since_previous_obs
1     1      2015-11-01   <null>       0
1     2      2015-11-03   2015-11-01   2
1     3      2015-11-07   2015-11-03   4

Is this possible with Python Pandas merge or join syntax?这是否可能与 Python Pandas mergejoin语法? Is there another smarter way?还有其他更聪明的方法吗?

Well, you can modify the rank before merge:好吧,您可以在合并之前修改排名:

(df.merge(df.assign(rank=df['rank'] - 1),
          on=['id','rank'], how='left')
   .assign(last_obs_since=lambda x: x['ts_y'] - x['ts_x'])
)

Output: Output:

   id  rank       ts_x       ts_y last_obs_since
0   1     1 2015-11-01 2015-11-02         1 days
1   1     2 2015-11-02 2015-11-03         1 days
2   1     3 2015-11-03        NaT            NaT
#create a list from ts and shift by one to make ts2
ts2 =df["ts"][:-1].tolist()
ts2.insert(0,None)

#append list to dataframe
df["ts2"] = ts2

#calculate difference
df["diff"] = df["ts"] - df["ts2"]
print(df)

output: output:

在此处输入图像描述

Following should also work,以下也应该工作,

df['ts2'] = df.shift(1)['ts']
df['last_obs_since'] = df['ts'] - df['ts2']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM