简体   繁体   中英

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id , feat1 and feat2 . feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df :

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:

def array_unique_values(df, fields):
    from pyspark.sql.functions import col, collect_set, explode
    from functools import reduce
    data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
    return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])

And then:

data = array_unique_values(df, my_fields)
data.take(1)

one more solution for spark 2.4+

.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))

beware, if one of columns is null, result will be null

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM