get the distinct elements of an ArrayType column in a spark dataframe

Question

I have a dataframe with 3 columns named id , feat1 and feat2 . feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?

Answer 1

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df :

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

Answer 2

The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:

def array_unique_values(df, fields):
    from pyspark.sql.functions import col, collect_set, explode
    from functools import reduce
    data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
    return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])

And then:

data = array_unique_values(df, my_fields)
data.take(1)

Answer 3

one more solution for spark 2.4+

.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))

beware, if one of columns is null, result will be null

get the distinct elements of an ArrayType column in a spark dataframe

Question

3 answers

solution1
2 2016-06-14 03:10:19

solution2
0 2017-12-04 20:51:32

solution3
0 2021-10-26 16:21:48

get the distinct elements of an ArrayType column in a spark dataframe

Question

3 answers

solution1 2 2016-06-14 03:10:19

solution2 0 2017-12-04 20:51:32

solution3 0 2021-10-26 16:21:48

solution1
2 2016-06-14 03:10:19

solution2
0 2017-12-04 20:51:32

solution3
0 2021-10-26 16:21:48