简体   繁体   English

如何根据不同条件加入或合并两个数据框?

[英]How to join or merge two dataframes based on different condition?

I want to merge or join two DataFrames based on different date.我想根据不同的日期合并或加入两个 DataFrame。 Join Completed date with any earlier Start date.将完成日期与任何更早的开始日期合并。 I have the following dataframes:我有以下数据框:

df1:
Complted_date
2015
2017
2020


df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021


And desired output is:


Complted_date   Start_date
2015            2001
2015            2010
2015            2012
2015            2015
2017            2001
2017            2010
2017            2012
2017            2015
2017            2016
2017            2017
2020            2001
2020            2010
2020            2012
2020            2015
2020            2016
2020            2017
2020            2018
2020            2019
2020            2020

I've tried but I'm not getting the output I want.我试过了,但没有得到我想要的 output。

Thank you for your help!!谢谢您的帮助!!

Check out merge , which gives you the expected output:查看merge ,它会为您提供预期的 output:

(df1.assign(key=1)
   .merge(df2.assign(key=1), on='key')
   .query('Complted_date>=Start_date')
   .drop('key', axis=1)
)

Output: Output:

    Complted_date  Start_date
0            2015        2001
1            2015        2010
2            2015        2012
3            2015        2015
10           2017        2001
11           2017        2010
12           2017        2012
13           2017        2015
14           2017        2016
15           2017        2017
20           2020        2001
21           2020        2010
22           2020        2012
23           2020        2015
24           2020        2016
25           2020        2017
26           2020        2018
27           2020        2019
28           2020        2020

However, you might want to check out merge_asof :但是,您可能想查看merge_asof

pd.merge_asof(df2, df1, 
              right_on='Complted_date', 
              left_on='Start_date', 
              direction='forward')

Output: Output:

   Start_date  Complted_date
0        2001         2015.0
1        2010         2015.0
2        2012         2015.0
3        2015         2015.0
4        2016         2017.0
5        2017         2017.0
6        2018         2020.0
7        2019         2020.0
8        2020         2020.0
9        2021            NaN

You can do cross-join and pick records which have Completed_date > Start_date :您可以进行cross-join并选择具有Completed_date > Start_date的记录:

Use df.merge with df.query :使用df.mergedf.query

In [101]: df1['tmp'] = 1

In [102]: df2['tmp'] = 1

In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)

In [108]: res
Out[108]: 
    Complted_date  Start_date
0            2015        2001
1            2015        2010
2            2015        2012
3            2015        2015
10           2017        2001
11           2017        2010
12           2017        2012
13           2017        2015
14           2017        2016
15           2017        2017
20           2020        2001
21           2020        2010
22           2020        2012
23           2020        2015
24           2020        2016
25           2020        2017
26           2020        2018
27           2020        2019
28           2020        2020

Here is another way using pd.Series() and explode()这是使用pd.Series()explode()的另一种方式

df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)

You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date :您可以使用pyjanitor中的conditional_join来获取compltd_date >= start_date的行:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]: 
            left      right
   Complted_date Start_date
0           2015       2001
1           2015       2010
2           2015       2012
3           2015       2015
4           2017       2001
5           2017       2010
6           2017       2012
7           2017       2015
8           2017       2016
9           2017       2017
10          2020       2001
11          2020       2010
12          2020       2012
13          2020       2015
14          2020       2016
15          2020       2017
16          2020       2018
17          2020       2019
18          2020       2020

Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.在引擎盖下,它只是二进制搜索 (searchsorted) - 目的是避免笛卡尔连接,并希望减少 memory 的使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM