Pyspark - 加入右侧数据集中的 null 值

Question

Let's say I have a dataset with the following:假设我有一个包含以下内容的数据集：

# dataset_left
 #+-----------------+--------------+---------------+
 #|         A       |       B      |      C        |
 #+-----------------+--------------+---------------+
 #|   some_value_1  | some_value_3 | some_value_5  |
 #+-----------------+--------------+---------------+
 #|   some_value_2  | some_value_4 | some_value_6  |
 #+-----------------+--------------+---------------+

I also have another dataset like the following:我还有另一个数据集，如下所示：

# dataset_rules
 #+-----------------+--------------+---------------+
 #|         A       |       B      |  result_col   |
 #+-----------------+--------------+---------------+
 #|        null     | some_value_3 |    result_1   |
 #+-----------------+--------------+---------------+
 #|   some_value_2  |      null    |    result_2   |
 #+-----------------+--------------+---------------+

My goal is to join the two datasets with this rule:我的目标是使用此规则加入两个数据集：

For the dataset_rules : null values in column A & column B can match any value from the dataset_left .对于dataset_rules ： A列和B列中的 null 值可以匹配dataset_left中的任何值。 The join should only take into account non-null values from the dataset_rules .连接应该只考虑dataset_rules中的非空值。

So for the 1st row in dataset_rule , only column B should be used as a condition.因此，对于dataset_rule的第一行，只有B列应该用作条件。 And for the 2nd row, only column A should be used as a condition.而对于第 2 行，仅应将A列用作条件。

I want to achieve the following desired result:我想达到以下预期结果：

# dataset_result
 #+-----------------+--------------+---------------+------------+
 #|         A       |       B      |      C        | result_col |
 #+-----------------+--------------+---------------+------------+
 #|   some_value_1  | some_value_3 | some_value_5  |   result_1 |
 #+-----------------+--------------+---------------+------------+
 #|   some_value_2  | some_value_4 | some_value_6  |   result_2 |
 #+-----------------+--------------+---------------+------------+

The goal is to avoid hard coding the rules in dataset_rules to make it easy to add new rules and more maintainable.目标是避免对dataset_rules中的规则进行硬编码，以便轻松添加新规则并更易于维护。

Answer 1

You can join using when or coalesce expression like this:您可以使用when或coalesce表达式加入，如下所示：

from pyspark.sql import functions as F

join_cond = (
        (F.coalesce(dataset_rules["A"], dataset_left["A"]) == dataset_left["A"])
        & (F.coalesce(dataset_rules["B"], dataset_left["B"]) == dataset_left["B"])
)

result = dataset_left.join(dataset_rules, join_cond, "left").select(
    dataset_left["*"],
    dataset_rules["result_col"]
)

result.show()
#+------------+------------+------------+----------+
#|           A|           B|           C|result_col|
#+------------+------------+------------+----------+
#|some_value_1|some_value_3|some_value_5|  result_1|
#|some_value_2|some_value_4|some_value_6|  result_2|
#+------------+------------+------------+----------+

Pyspark - 加入右侧数据集中的 null 值

问题描述

1 个解决方案

解决方案1
0 2022-01-28 09:11:47

Pyspark - 加入右侧数据集中的 null 值

问题描述

1 个解决方案

解决方案1 0 2022-01-28 09:11:47

解决方案1
0 2022-01-28 09:11:47