简体   繁体   English

获取火花数据帧中 ArrayType 列的不同元素

[英]get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id , feat1 and feat2 .我有一个包含 3 列的数据,名为idfeat1feat2 feat1 and feat2 are in the form of Array of String: feat1feat2是字符串数组的形式:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:我想获取每个特征列中不同元素的列表,因此输出将是:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?在 Scala 中执行此操作的最佳方法是什么?

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. 在每列上应用explode函数以取消每个单元格中的数组元素后,可以使用collect_set查找相应列的不同值。 Suppose your data frame is called df : 假设您的数据框名为df

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields: Psidom提供的方法效果很好,这是一个函数,在给定Dataframe和字段列表的情况下也是如此:

def array_unique_values(df, fields):
    from pyspark.sql.functions import col, collect_set, explode
    from functools import reduce
    data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
    return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])

And then: 然后:

data = array_unique_values(df, my_fields)
data.take(1)

one more solution for spark 2.4+ spark 2.4+ 的另一种解决方案

.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))

beware, if one of columns is null, result will be null请注意,如果其中一列为空,则结果将为空

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM