How can I do conditional joins in python like subquery?

Question

I have 2 dfs.

I want to get df2's category, but there are no common keys.

df1's code should be between range1 and range2.

df2's s_date has only 2 values(20210101, 20220101)

df1

code	date
A51	20221215
C52	20221215
A51	20211215
D89	20211215

df2

range1	range2	s_date	category
A00	B99	20220101	A
C50	C60	20220101	B
A00	B60	20210101	AA
D00	D90	20210101	DD

For output, I did this work using SQL like below statement,

but I want to do this work using python(pandas).

output

code	date	category
A51	20221215	A
C52	20221215	B
A51	20211215	AA
D89	20211215	DD

SELECT 
 code, date,
 CASE WHEN date >= '20220101'
 THEN (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20220101'
 ELSE (SELECT category
      FROM df2
      WHERE code BETWEEN range1 AND range2
      AND   s_date = '20210101') END AS category
FROM df1

Thanks.

Answer 1

You can directrly compare strings in python. For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False . We can use it to filter the df2 by this values.

Here is cow it might look in code:

categories = []
for row in df1.values:
    code = row['code']
    date = row['date']
    if date >= '20220101':
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
    else:
        category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
    categories.append(category)
df1['category'] = categories

There is perhaps much more elegant approach, but the main pattern is like that.

Answer 2

date and code impose an ordering on the rows. One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.

We'll define an encoding function which assumes code s are guaranteed to be three ascii characters (see comment ).

from struct import pack, unpack

def encode(row, code_label='code', date_label='date'):
    date_is_ge_20220101 = row[date_label] >= '20220101'
    code_as_bytes = bytes(row[code_label], encoding='ascii')
    return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]

Perform the encoding...

df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
                                    code_label='range1',
                                    date_label='s_date')

Do the merge keeping in mind that merge_asof requires the merge values to be sorted.

pd.merge_asof(df1.sort_values('encoding'),
              df2.sort_values('encoding'),
              on='encoding')

giving...

    code date     encoding  range1  range2  s_date   category
0   A51  20211215 4273457   A00     B60     20210101 AA
1   D89  20211215 4470841   D00     D90     20210101 DD
2   A51  20221215 21050673  A00     B99     20220101 A
3   C52  20221215 21181746  C50     C60     20220101 B

For the record, I would only take this approach if absolutely necessary to meet a project's requirements. It is too far removed from the original problem statement, albeit interesting.

How can I do conditional joins in python like subquery?

Question

2 answers

solution1
0 2022-12-14 17:24:48

solution2
0 2022-12-16 15:15:55

How can I do conditional joins in python like subquery?

Question

2 answers

solution1 0 2022-12-14 17:24:48

solution2 0 2022-12-16 15:15:55

solution1
0 2022-12-14 17:24:48

solution2
0 2022-12-16 15:15:55