[英]How can I do conditional joins in python like subquery?
I have 2 dfs.我有 2 个 df。
I want to get df2's category, but there are no common keys.我想获取 df2 的类别,但没有公共键。
df1's code should be between range1 and range2. df1 的代码应该在 range1 和 range2 之间。
df2's s_date has only 2 values(20210101, 20220101) df2 的 s_date 只有 2 个值(20210101、20220101)
df1 df1
code![]() |
date![]() |
---|---|
A51 ![]() |
20221215 ![]() |
C52 ![]() |
20221215 ![]() |
A51 ![]() |
20211215 ![]() |
D89 ![]() |
20211215 ![]() |
df2 df2
range1![]() |
range2![]() |
s_date ![]() |
category![]() |
---|---|---|---|
A00 ![]() |
B99 ![]() |
20220101 ![]() |
A![]() |
C50 ![]() |
C60 ![]() |
20220101 ![]() |
B![]() |
A00 ![]() |
B60 ![]() |
20210101 ![]() |
AA ![]() |
D00 ![]() |
D90 ![]() |
20210101 ![]() |
DD ![]() |
For output, I did this work using SQL like below statement,对于输出,我使用如下语句的 SQL 完成了这项工作,
but I want to do this work using python(pandas).但我想用 python(pandas) 来完成这项工作。
output输出
code![]() |
date![]() |
category![]() |
---|---|---|
A51 ![]() |
20221215 ![]() |
A![]() |
C52 ![]() |
20221215 ![]() |
B![]() |
A51 ![]() |
20211215 ![]() |
AA ![]() |
D89 ![]() |
20211215 ![]() |
DD ![]() |
SELECT
code, date,
CASE WHEN date >= '20220101'
THEN (SELECT category
FROM df2
WHERE code BETWEEN range1 AND range2
AND s_date = '20220101'
ELSE (SELECT category
FROM df2
WHERE code BETWEEN range1 AND range2
AND s_date = '20210101') END AS category
FROM df1
Thanks.谢谢。
You can directrly compare strings in python.您可以直接在 python 中比较字符串。 For example, 'A50' > 'A00' will result in True , while 'A50' > 'B00' will return False .
例如,'A50' > 'A00' 将返回True ,而 'A50' > 'B00' 将返回False 。 We can use it to filter the df2 by this values.
我们可以使用它按此值过滤 df2。
Here is cow it might look in code:这是它可能在代码中看到的奶牛:
categories = []
for row in df1.values:
code = row['code']
date = row['date']
if date >= '20220101':
category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20220101')]['category']
else:
category = df2[(df2['range1'] < code) & (code < df2['range2']) & (df2['s_date'] == '20210101')]['category']
categories.append(category)
df1['category'] = categories
There is perhaps much more elegant approach, but the main pattern is like that.也许有更优雅的方法,但主要模式就是这样。
date
and code
impose an ordering on the rows. date
和code
对行进行排序。 One approach is to define an integer encoding of these columns, then use merge_asof to combine the dataframes.一种方法是定义这些列的整数编码,然后使用merge_asof组合数据帧。
We'll define an encoding function which assumes code
s are guaranteed to be three ascii characters (see comment ).我们将定义一个编码函数,假设
code
保证是三个 ascii 字符(见注释)。
from struct import pack, unpack
def encode(row, code_label='code', date_label='date'):
date_is_ge_20220101 = row[date_label] >= '20220101'
code_as_bytes = bytes(row[code_label], encoding='ascii')
return unpack('>I', pack('?3s', date_is_ge_20220101, code_as_bytes))[0]
Perform the encoding...执行编码...
df1['encoding'] = df1.apply(encode, axis='columns')
df2['encoding'] = df2.apply(encode, axis='columns',
code_label='range1',
date_label='s_date')
Do the merge keeping in mind that merge_asof
requires the merge values to be sorted.合并时请记住
merge_asof
需要对合并值进行排序。
pd.merge_asof(df1.sort_values('encoding'),
df2.sort_values('encoding'),
on='encoding')
giving...给...
code date encoding range1 range2 s_date category
0 A51 20211215 4273457 A00 B60 20210101 AA
1 D89 20211215 4470841 D00 D90 20210101 DD
2 A51 20221215 21050673 A00 B99 20220101 A
3 C52 20221215 21181746 C50 C60 20220101 B
For the record, I would only take this approach if absolutely necessary to meet a project's requirements.郑重声明,只有在绝对必要的情况下才能满足项目的要求,我才会采用这种方法。 It is too far removed from the original problem statement, albeit interesting.
它与最初的问题陈述相去甚远,尽管很有趣。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.