PySpark：基于 array_contains 加入数据框列

Question

我有两个数据框：

sdf1 = spark.createDataFrame([
    ("123", "A", [1, 2, 3]),
    ("123","B", [4, 5]),
    ("456","C", [1, 2]),
    ("456","D", [3, 4, 5]),
], ["id1", "name", "resources"])

sdf2 = spark.createDataFrame([
    ("123", 1, "R1"),
    ("123", 2, "R2"),
    ("123", 3, "R3"),
    ("123", 4, "R4"),
    ("123", 5, "R5"),
    ("456", 1, "R1"),
    ("456", 2, "R2"),
    ("456", 3, "R7"),
    ("456", 4, "R8"),
    ("456", 5, "R9")
], ["id2", "resource_id", "name"])

预期结果：

+----+-----+-----------+-------------+
|id1 |name |resources  |New Column   |
+----+-----+-----------+-------------+
|123 |A    |[1, 2, 3]  |[R1, R2, R3] |
|123 |B    |[4, 5]     |[R4, R5]     |
|456 |C    |[1, 2]     |[R1, R2]     |
|456 |D    |[3, 4, 5]  |[R7, R8, R9] |
+----+---------+------+--------------+

我试过这种方式：

res_sdf = sdf1.join(sdf2, on=[(sdf1.id1 == sdf2.id2) & array_contains(sdf1.resources, sdf2.resource_id)], how='left')

但我收到错误： TypeError: Column is not iterable

正确的做法是什么？

谢谢！

Answer 1

试试这个代码：

    from pyspark.sql.functions import udf , collect_list

    contain_udf = udf(lambda x , y : x in y)

    res_sdf = sdf1.join(sdf2, on=[(sdf1.id1 == sdf2.id2)] ,how ='left').filter(contain_udf("resource_id","resources") == True)
    res_sdf = res_sdf.groupBy(sdf1.id1,sdf1.name,"resources").agg(collect_list(sdf2.name).alias("New Column")).orderBy("id1")

PySpark：基于 array_contains 加入数据框列

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-21 04:11:14

PySpark：基于 array_contains 加入数据框列

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-21 04:11:14

解决方案1
0 已采纳 2020-03-21 04:11:14