将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id

Question

I have a pyspark.sql dataframe that looks like this:我有一个pyspark.sql Z6A8064B5DF4794555500553C47C55057DZ 看起来像这样：

id ID	name姓名	refs参考文献
1 1	A一个	B, C,D B、C,D
2 2	B乙	A一个
3 3	C C	A, B甲，乙

I'm trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there's no matching name in the Name column then it would ideally filter that value out or set it to null .我正在尝试构建一个 function ，它将ref中每个数组的值替换为它引用的名称的相应ID ，如果Name列中没有匹配的名称，那么理想情况下会过滤掉该值或将其设置为null . The results would ideally look something like this:理想情况下，结果看起来像这样：

id ID	name姓名	refs参考文献
1 1	A一个	2, 3 2, 3
2 2	B乙	1 1
3 3	C C	1, 2 1, 2

I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names.我尝试通过定义一个 UDF 来执行此操作，该 UDF 从表中收集所有名称，然后获取每个 ref 数组和所有名称集合之间的交集的索引。 It works but is extremely slow, I'm sure there's probably better ways to do this using Spark and/or SQL.它可以工作，但速度极慢，我相信使用 Spark 和/或 SQL 可能有更好的方法。

Answer 1

You can explode the arrays, do a self-join using the exploded ref and name, and collect the joined ids back to an array using collect_list .您可以分解explode ，使用分解的 ref 和名称进行自连接，并使用collect_list将连接的 id 收集回数组。

import pyspark.sql.functions as F

df1 = df.select('id', 'name', F.explode('refs').alias('refs'))
df2 = df.toDF('id2', 'name2', 'refs2')

result = df1.join(df2, df1.refs == df2.name2) \
            .select('id', 'name', 'id2') \
            .groupBy('id', 'name') \
            .agg(F.collect_list('id2').alias('refs'))

result.show()
+---+----+------+
| id|name|  refs|
+---+----+------+
|  1|   A|[2, 3]|
|  2|   B|   [1]|
|  3|   C|[1, 2]|
+---+----+------+

将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-10 12:11:34

将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-10 12:11:34

解决方案1
3 已采纳 2021-01-10 12:11:34