如何将 Spark Dataframe 列从向量转换为集合？

Question

I need to process a dataset to identify frequent itemsets.我需要处理一个数据集来识别频繁项集。 So the input column must be a vector.所以输入列必须是一个向量。 The original column is a string with the items separated by comma, so i did the following:原始列是一个字符串，项目以逗号分隔，因此我执行了以下操作：

functions.split(out_1['skills'], ',')

The problem is the, for some rows, I have duplicated values in the skills and this is causing an error when trying to identify the frequent itemsets.问题是，对于某些行，我在skills有重复的值，这在尝试识别频繁项集时导致错误。

I wanted to convert the vector to a set to remove the duplicated elements.我想将向量转换为集合以删除重复的元素。 Something like this:像这样的东西：

functions.to_set(functions.split(out_1['skills'], ','))

But I could not find a function to convert a column from vector to set, ie, there is no to_set function.但是我找不到将列从向量转换为集合的函数，即没有to_set函数。

How can I accomplish what I want, ie, remove the duplicated elements from the vector?我怎样才能完成我想要的，即从向量中删除重复的元素？

Answer 1

You can convert the set function in python to a udf using functions.udf(set) and then apply it to the array column: 您可以使用functions.udf(set)将python中的set函数转换为udf，然后将其应用于array列：

df.show()
+-------+
| skills|
+-------+
|a,a,b,c|
|  a,b,c|
|c,d,e,e|
+-------+

import pyspark.sql.functions as F
df.withColumn("unique_skills", F.udf(set)(F.split(df.skills, ","))).show()
+-------+-------------+
| skills|unique_skills|
+-------+-------------+
|a,a,b,c|    [a, b, c]|
|  a,b,c|    [a, b, c]|
|c,d,e,e|    [c, d, e]|
+-------+-------------+

Answer 2

It is recommended, when possible, to use native spark functions instead of UDFs for efficiency reasons.出于效率原因，建议在可能的情况下使用本机 spark 函数而不是 UDF。 There is a dedicated function to leave only unique items in an array column: array_distinct() introduced in spark 2.4.0有一个专用函数可以在数组列中只保留唯一项： spark 2.4.0 中引入的array_distinct()

from pyspark import Row
from pyspark.shell import spark
import pyspark.sql.functions as F

df = spark.createDataFrame([
    Row(skills='a,a,b,c'),
    Row(skills='a,b,c'),
    Row(skills='c,d,e,e'),
])

df = df.withColumn('skills_arr', F.array_distinct(F.split(df.skills, ",")))

result:结果：

+-------+----------+
|skills |skills_arr|
+-------+----------+
|a,a,b,c|[a, b, c] |
|a,b,c  |[a, b, c] |
|c,d,e,e|[c, d, e] |
+-------+----------+

如何将 Spark Dataframe 列从向量转换为集合？

问题描述

2 个解决方案

解决方案1
2 2017-10-08 18:18:35

解决方案2
0 2021-11-30 15:09:08

如何将 Spark Dataframe 列从向量转换为集合？

问题描述

2 个解决方案

解决方案1 2 2017-10-08 18:18:35

解决方案2 0 2021-11-30 15:09:08

解决方案1
2 2017-10-08 18:18:35

解决方案2
0 2021-11-30 15:09:08