简体   繁体   中英

How can I do conditional joins in python like subquery?

I have 2 dfs.

I want to get df2's category, but there are no common keys.

df1's code should be between range1 and range2.

df2's s_date has only 2 values(20210101, 20220101)

df1

code date
A51 20221215
C52 20221215
A51 20211215
D89 20211215

df2

range1 range2 s_date category
A00 B99 20220101 A
C50 C60 20220101 B
A00 B60 20210101 AA
D00 D90 20210101 DD

For output, I did this work using SQL like below statement,

but I want to do this work using python(pandas).

output

code date category
A51 20221215 A
C52 20221215 B
A51 20211215 AA
D89 20211215 DD
SELECT 
 code, date,
 CASE WHEN date >= '20220101'
 THEN (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20220101'
 ELSE (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20210101') END AS category
FROM df1

Thanks.

You can directrly compare strings in python. For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False . We can use it to filter the df2 by this values.

Here is cow it might look in code:

categories = []
for row in df1.values:
    code = row['code']
    date = row['date']
    if date >= '20220101':
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
    else:
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
    categories.append(category)
df1['category'] = categories

There is perhaps much more elegant approach, but the main pattern is like that.

date and code impose an ordering on the rows. One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.

We'll define an encoding function which assumes code s are guaranteed to be three ascii characters (see comment ).

from struct import pack, unpack

def encode(row, code_label='code', date_label='date'):
    date_is_ge_20220101 = row[date_label] >= '20220101'
    code_as_bytes = bytes(row[code_label], encoding='ascii')
    return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]

Perform the encoding...

df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
                                    code_label='range1',
                                    date_label='s_date')

Do the merge keeping in mind that merge_asof requires the merge values to be sorted.

pd.merge_asof(df1.sort_values('encoding'),
              df2.sort_values('encoding'),
              on='encoding')

giving...

    code date     encoding  range1  range2  s_date   category
0   A51  20211215 4273457   A00     B60     20210101 AA
1   D89  20211215 4470841   D00     D90     20210101 DD
2   A51  20221215 21050673  A00     B99     20220101 A
3   C52  20221215 21181746  C50     C60     20220101 B

For the record, I would only take this approach if absolutely necessary to meet a project's requirements. It is too far removed from the original problem statement, albeit interesting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM