我怎样才能像子查询一样在 python 中进行条件连接？

Question

I have 2 dfs.我有 2 个 df。

I want to get df2's category, but there are no common keys.我想获取 df2 的类别，但没有公共键。

df1's code should be between range1 and range2. df1 的代码应该在 range1 和 range2 之间。

df2's s_date has only 2 values(20210101, 20220101) df2 的 s_date 只有 2 个值（20210101、20220101）

df1 df1

code代码	date日期
A51 A51	20221215 20221215
C52 C52	20221215 20221215
A51 A51	20211215 20211215
D89 D89	20211215 20211215

df2 df2

range1范围 1	range2范围2	s_date s_date	category类别
A00 A00	B99 B99	20220101 20220101	A一种
C50 C50	C60 C60	20220101 20220101	B乙
A00 A00	B60 B60	20210101 20210101	AA AA
D00 D00	D90 D90	20210101 20210101	DD DD

For output, I did this work using SQL like below statement,对于输出，我使用如下语句的 SQL 完成了这项工作，

but I want to do this work using python(pandas).但我想用 python(pandas) 来完成这项工作。

output输出

code代码	date日期	category类别
A51 A51	20221215 20221215	A一种
C52 C52	20221215 20221215	B乙
A51 A51	20211215 20211215	AA AA
D89 D89	20211215 20211215	DD DD

SELECT 
 code, date,
 CASE WHEN date >= '20220101'
 THEN (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20220101'
 ELSE (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20210101') END AS category
FROM df1

Thanks.谢谢。

Answer 1

You can directrly compare strings in python.您可以直接在 python 中比较字符串。 For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False .例如，'A50' > 'A00' 将返回True ，而 'A50' > 'B00' 将返回False 。 We can use it to filter the df2 by this values.我们可以使用它按此值过滤 df2。

Here is cow it might look in code:这是它可能在代码中看到的奶牛：

categories = []
for row in df1.values:
    code = row['code']
    date = row['date']
    if date >= '20220101':
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
    else:
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
    categories.append(category)
df1['category'] = categories

There is perhaps much more elegant approach, but the main pattern is like that.也许有更优雅的方法，但主要模式就是这样。

Answer 2

date and code impose an ordering on the rows. date和code对行进行排序。 One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.一种方法是定义这些列的整数编码，然后使用merge_asof组合数据帧。

We'll define an encoding function which assumes code s are guaranteed to be three ascii characters (see comment ).我们将定义一个编码函数，假设code保证是三个 ascii 字符（见注释）。

from struct import pack, unpack

def encode(row, code_label='code', date_label='date'):
    date_is_ge_20220101 = row[date_label] >= '20220101'
    code_as_bytes = bytes(row[code_label], encoding='ascii')
    return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]

Perform the encoding...执行编码...

df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
                                    code_label='range1',
                                    date_label='s_date')

Do the merge keeping in mind that merge_asof requires the merge values to be sorted.合并时请记住merge_asof需要对合并值进行排序。

pd.merge_asof(df1.sort_values('encoding'),
              df2.sort_values('encoding'),
              on='encoding')

giving...给...

    code date     encoding  range1  range2  s_date   category
0   A51  20211215 4273457   A00     B60     20210101 AA
1   D89  20211215 4470841   D00     D90     20210101 DD
2   A51  20221215 21050673  A00     B99     20220101 A
3   C52  20221215 21181746  C50     C60     20220101 B

For the record, I would only take this approach if absolutely necessary to meet a project's requirements.郑重声明，只有在绝对必要的情况下才能满足项目的要求，我才会采用这种方法。 It is too far removed from the original problem statement, albeit interesting.它与最初的问题陈述相去甚远，尽管很有趣。

我怎样才能像子查询一样在 python 中进行条件连接？

问题描述

2 个解决方案

解决方案1
0 2022-12-14 17:24:48

解决方案2
0 2022-12-16 15:15:55

我怎样才能像子查询一样在 python 中进行条件连接？

问题描述

2 个解决方案

解决方案1 0 2022-12-14 17:24:48

解决方案2 0 2022-12-16 15:15:55

解决方案1
0 2022-12-14 17:24:48

解决方案2
0 2022-12-16 15:15:55