使用“条件”条件连接 Pyspark 中的表

Question

I have two tables I want to join:我有两个要加入的表：

Table X:表十：

country国家	city城市	user用户
USA美国	Boston波士顿	David大卫
USA美国	Miami迈阿密	John约翰
France法国	Paris巴黎	Peter彼得

Table Y:表 Y：

Country国家	detail细节	value价值	id ID
USA美国	city城市	Boston波士顿	1 1
USA美国	null无效的	null无效的	2 2
France法国	null无效的	null无效的	3 3

And this is the output I want:这是我想要的输出：

Country国家	id ID	city城市	user用户
USA美国	1 1	Boston波士顿	David大卫
USA美国	2 2	null无效的	David大卫
USA美国	2 2	null无效的	John约翰
France法国	3 3	null无效的	Peter彼得

The way I get this in SQL is:我在 SQL 中得到这个的方法是：

select country, id, city, user
from X
join Y 
     on x.country = y.country
     and if(y.detail='city', x.city=y.value, TRUE)

How can I get in pyspark?我怎样才能进入 pyspark？

Answer 1

You can do so with the code below, however I had to select y.value and alias it to city in order to get your example output.您可以使用下面的代码执行此操作，但是我必须选择y.value并将其别名为 city 以获得您的示例输出。

d1 = [
    ('USA', 'Boston', 'David'),
    ('USA', 'Miami', 'John'),
    ('France', 'Paris', 'Peter')
]

d2 = [
    ('USA', 'city', 'Boston', 1),
    ('USA', None, None, 2),
    ('France', None, None, 3)
]

x = spark.createDataFrame(d1, ['country', 'city', 'user'])
y = spark.createDataFrame(d2, ['country', 'detail', 'value', 'id'])

cond = (x.country == y.country) & (when(y.detail == 'city', x.city == y.value).otherwise(F.lit(True)))

x.join(y, on=cond).select(x.country, y.id, y.value.alias('city'), x.user).orderBy('id').show()

+-------+---+------+-----+
|country| id|  city| user|
+-------+---+------+-----+
|    USA|  1|Boston|David|
|    USA|  2|  null|David|
|    USA|  2|  null| John|
| France|  3|  null|Peter|
+-------+---+------+-----+

使用“条件”条件连接 Pyspark 中的表

问题描述

1 个解决方案

解决方案1
1 2022-06-26 14:10:55

使用“条件”条件连接 Pyspark 中的表

问题描述

1 个解决方案

解决方案1 1 2022-06-26 14:10:55

解决方案1
1 2022-06-26 14:10:55