简体   繁体   English

我怎样才能像子查询一样在 python 中进行条件连接?

[英]How can I do conditional joins in python like subquery?

I have 2 dfs.我有 2 个 df。

I want to get df2's category, but there are no common keys.我想获取 df2 的类别,但没有公共键。

df1's code should be between range1 and range2. df1 的代码应该在 range1 和 range2 之间。

df2's s_date has only 2 values(20210101, 20220101) df2 的 s_date 只有 2 个值(20210101、20220101)

df1 df1

code代码 date日期
A51 A51 20221215 20221215
C52 C52 20221215 20221215
A51 A51 20211215 20211215
D89 D89 20211215 20211215

df2 df2

range1范围 1 range2范围2 s_date s_date category类别
A00 A00 B99 B99 20220101 20220101 A一种
C50 C50 C60 C60 20220101 20220101 B
A00 A00 B60 B60 20210101 20210101 AA AA
D00 D00 D90 D90 20210101 20210101 DD DD

For output, I did this work using SQL like below statement,对于输出,我使用如下语句的 SQL 完成了这项工作,

but I want to do this work using python(pandas).但我想用 python(pandas) 来完成这项工作。

output输出

code代码 date日期 category类别
A51 A51 20221215 20221215 A一种
C52 C52 20221215 20221215 B
A51 A51 20211215 20211215 AA AA
D89 D89 20211215 20211215 DD DD
SELECT 
 code, date,
 CASE WHEN date >= '20220101'
 THEN (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20220101'
 ELSE (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20210101') END AS category
FROM df1

Thanks.谢谢。

You can directrly compare strings in python.您可以直接在 python 中比较字符串。 For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False .例如,'A50' > 'A00' 将返回True ,而 'A50' > 'B00' 将返回False We can use it to filter the df2 by this values.我们可以使用它按此值过滤 df2。

Here is cow it might look in code:这是它可能在代码中看到的奶牛:

categories = []
for row in df1.values:
    code = row['code']
    date = row['date']
    if date >= '20220101':
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
    else:
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
    categories.append(category)
df1['category'] = categories

There is perhaps much more elegant approach, but the main pattern is like that.也许有更优雅的方法,但主要模式就是这样。

date and code impose an ordering on the rows. datecode对行进行排序。 One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.一种方法是定义这些列的整数编码,然后使用merge_asof组合数据帧。

We'll define an encoding function which assumes code s are guaranteed to be three ascii characters (see comment ).我们将定义一个编码函数,假设code保证是三个 ascii 字符(见注释)。

from struct import pack, unpack

def encode(row, code_label='code', date_label='date'):
    date_is_ge_20220101 = row[date_label] >= '20220101'
    code_as_bytes = bytes(row[code_label], encoding='ascii')
    return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]

Perform the encoding...执行编码...

df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
                                    code_label='range1',
                                    date_label='s_date')

Do the merge keeping in mind that merge_asof requires the merge values to be sorted.合并时请记住merge_asof需要对合并值进行排序。

pd.merge_asof(df1.sort_values('encoding'),
              df2.sort_values('encoding'),
              on='encoding')

giving...给...

    code date     encoding  range1  range2  s_date   category
0   A51  20211215 4273457   A00     B60     20210101 AA
1   D89  20211215 4470841   D00     D90     20210101 DD
2   A51  20221215 21050673  A00     B99     20220101 A
3   C52  20221215 21181746  C50     C60     20220101 B

For the record, I would only take this approach if absolutely necessary to meet a project's requirements.郑重声明,只有在绝对必要的情况下才能满足项目的要求,我才会采用这种方法。 It is too far removed from the original problem statement, albeit interesting.它与最初的问题陈述相去甚远,尽管很有趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM