简体   繁体   English

将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id

[英]Replace values of each array in pyspark dataframe array column by their corresponding ids

I have a pyspark.sql dataframe that looks like this:我有一个pyspark.sql Z6A8064B5DF4794555500553C47C55057DZ 看起来像这样:

id ID name姓名 refs参考文献
1 1 A一个 B, C,D B、C,D
2 2 B A一个
3 3 C C A, B甲,乙

I'm trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there's no matching name in the Name column then it would ideally filter that value out or set it to null .我正在尝试构建一个 function ,它将ref中每个数组的值替换为它引用的名称的相应ID ,如果Name列中没有匹配的名称,那么理想情况下会过滤掉该值或将其设置为null . The results would ideally look something like this:理想情况下,结果看起来像这样:

id ID name姓名 refs参考文献
1 1 A一个 2, 3 2, 3
2 2 B 1 1
3 3 C C 1, 2 1, 2

I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names.我尝试通过定义一个 UDF 来执行此操作,该 UDF 从表中收集所有名称,然后获取每个 ref 数组和所有名称集合之间的交集的索引。 It works but is extremely slow, I'm sure there's probably better ways to do this using Spark and/or SQL.它可以工作,但速度极慢,我相信使用 Spark 和/或 SQL 可能有更好的方法。

You can explode the arrays, do a self-join using the exploded ref and name, and collect the joined ids back to an array using collect_list .您可以分解explode ,使用分解的 ref 和名称进行自连接,并使用collect_list将连接的 id 收集回数组。

import pyspark.sql.functions as F

df1 = df.select('id', 'name', F.explode('refs').alias('refs'))
df2 = df.toDF('id2', 'name2', 'refs2')

result = df1.join(df2, df1.refs == df2.name2) \
            .select('id', 'name', 'id2') \
            .groupBy('id', 'name') \
            .agg(F.collect_list('id2').alias('refs'))

result.show()
+---+----+------+
| id|name|  refs|
+---+----+------+
|  1|   A|[2, 3]|
|  2|   B|   [1]|
|  3|   C|[1, 2]|
+---+----+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM