[英]Pyspark - Join with null values in right dataset
Let's say I have a dataset with the following:假设我有一个包含以下内容的数据集:
# dataset_left
#+-----------------+--------------+---------------+
#| A | B | C |
#+-----------------+--------------+---------------+
#| some_value_1 | some_value_3 | some_value_5 |
#+-----------------+--------------+---------------+
#| some_value_2 | some_value_4 | some_value_6 |
#+-----------------+--------------+---------------+
I also have another dataset like the following:我还有另一个数据集,如下所示:
# dataset_rules
#+-----------------+--------------+---------------+
#| A | B | result_col |
#+-----------------+--------------+---------------+
#| null | some_value_3 | result_1 |
#+-----------------+--------------+---------------+
#| some_value_2 | null | result_2 |
#+-----------------+--------------+---------------+
My goal is to join the two datasets with this rule:我的目标是使用此规则加入两个数据集:
For the dataset_rules
: null values in column A
& column B
can match any value from the dataset_left
.对于
dataset_rules
: A
列和B
列中的 null 值可以匹配dataset_left
中的任何值。 The join should only take into account non-null values from the dataset_rules
.连接应该只考虑
dataset_rules
中的非空值。
So for the 1st row in dataset_rule
, only column B
should be used as a condition.因此,对于
dataset_rule
的第一行,只有B
列应该用作条件。 And for the 2nd row, only column A
should be used as a condition.而对于第 2 行,仅应将
A
列用作条件。
I want to achieve the following desired result:我想达到以下预期结果:
# dataset_result
#+-----------------+--------------+---------------+------------+
#| A | B | C | result_col |
#+-----------------+--------------+---------------+------------+
#| some_value_1 | some_value_3 | some_value_5 | result_1 |
#+-----------------+--------------+---------------+------------+
#| some_value_2 | some_value_4 | some_value_6 | result_2 |
#+-----------------+--------------+---------------+------------+
The goal is to avoid hard coding the rules in dataset_rules
to make it easy to add new rules and more maintainable.目标是避免对
dataset_rules
中的规则进行硬编码,以便轻松添加新规则并更易于维护。
You can join using when
or coalesce
expression like this:您可以使用
when
或coalesce
表达式加入,如下所示:
from pyspark.sql import functions as F
join_cond = (
(F.coalesce(dataset_rules["A"], dataset_left["A"]) == dataset_left["A"])
& (F.coalesce(dataset_rules["B"], dataset_left["B"]) == dataset_left["B"])
)
result = dataset_left.join(dataset_rules, join_cond, "left").select(
dataset_left["*"],
dataset_rules["result_col"]
)
result.show()
#+------------+------------+------------+----------+
#| A| B| C|result_col|
#+------------+------------+------------+----------+
#|some_value_1|some_value_3|some_value_5| result_1|
#|some_value_2|some_value_4|some_value_6| result_2|
#+------------+------------+------------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.