I have 2 dfs.
I want to get df2's category, but there are no common keys.
df1's code should be between range1 and range2.
df2's s_date has only 2 values(20210101, 20220101)
df1
code | date |
---|---|
A51 | 20221215 |
C52 | 20221215 |
A51 | 20211215 |
D89 | 20211215 |
df2
range1 | range2 | s_date | category |
---|---|---|---|
A00 | B99 | 20220101 | A |
C50 | C60 | 20220101 | B |
A00 | B60 | 20210101 | AA |
D00 | D90 | 20210101 | DD |
For output, I did this work using SQL like below statement,
but I want to do this work using python(pandas).
output
code | date | category |
---|---|---|
A51 | 20221215 | A |
C52 | 20221215 | B |
A51 | 20211215 | AA |
D89 | 20211215 | DD |
SELECT
code, date,
CASE WHEN date >= '20220101'
THEN (SELECT category
FROM df2
WHERE code BETWEEN range1 AND range2
AND s_date = '20220101'
ELSE (SELECT category
FROM df2
WHERE code BETWEEN range1 AND range2
AND s_date = '20210101') END AS category
FROM df1
Thanks.
You can directrly compare strings in python. For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False . We can use it to filter the df2 by this values.
Here is cow it might look in code:
categories = []
for row in df1.values:
code = row['code']
date = row['date']
if date >= '20220101':
category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
else:
category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
categories.append(category)
df1['category'] = categories
There is perhaps much more elegant approach, but the main pattern is like that.
date
and code
impose an ordering on the rows. One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.
We'll define an encoding function which assumes code
s are guaranteed to be three ascii characters (see comment ).
from struct import pack, unpack
def encode(row, code_label='code', date_label='date'):
date_is_ge_20220101 = row[date_label] >= '20220101'
code_as_bytes = bytes(row[code_label], encoding='ascii')
return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]
Perform the encoding...
df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
code_label='range1',
date_label='s_date')
Do the merge keeping in mind that merge_asof
requires the merge values to be sorted.
pd.merge_asof(df1.sort_values('encoding'),
df2.sort_values('encoding'),
on='encoding')
giving...
code date encoding range1 range2 s_date category
0 A51 20211215 4273457 A00 B60 20210101 AA
1 D89 20211215 4470841 D00 D90 20210101 DD
2 A51 20221215 21050673 A00 B99 20220101 A
3 C52 20221215 21181746 C50 C60 20220101 B
For the record, I would only take this approach if absolutely necessary to meet a project's requirements. It is too far removed from the original problem statement, albeit interesting.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.