Remove duplicates within Spark array column

Question

I have a given DataSet:

+-------------------+--------------------+
|               date|            products|
+-------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|
|2017-09-21 22:00:00|[361, 361, 361, 3...|
|2017-09-28 22:00:00|[360, 361, 361, 3...|

where products column is an array of strings with possible duplicated items.

I would like to remove this duplication (within one row)

What I did is basically write an UDF function like that

 val removeDuplicates: WrappedArray[String] => WrappedArray[String] = _.distinct
 val udfremoveDuplicates = udf(removeDuplicates)

This solution gives me a proper results:

+-------------------+--------------------+--------------------+
|               date|            products|       rm_duplicates|
+-------------------+--------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|[361, 362, 363, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|[361, 362, 363, 3...|

My questions are:

Do Spark provides a better/more efficient way of getting this result?
I was thinking about using a map - but how to get desired column as a List to be able to use 'distinct' method like in my removeDuplicates lambda?

Edit: I marked this topic with java tag, because it does not matter to me in which language (scala or java) I will get an answear:) Edit2: typos

Answer 1

The approach presented in the question--using a UDF--is the best approach as spark-sql has no built-in primitive to uniquify arrays.

If you are dealing with massive amounts of data and/or the array values have unique properties then it's worth thinking about the implementation of the UDF .

WrappedArray.distinct builds a mutable.HashSet behind the scenes and then traverses it to build the array of distinct elements. There are two possible problems with this from a performance standpoint:

Scala's mutable collections are not wonderfully efficient, which is why in the guts of Spark you'll find a lot of Java collections and while loops. If you are in need of extreme performance, you can implement your own generic distinct using faster data structures.
A generic implementation of distinct does not take advantage of any properties of your data. For example, if the arrays will be small on average then a simple implementation that builds directly into an array and does a linear search for duplicates may perform much better than code that builds a complex data structure, despite it's theoretical O(n^2) complexity. For another example, if the values can only be numbers in a small range, or strings from a small set, you can implement uniquification via a bit set.

Again, these strategies should only be considered if you have ridiculous amounts of data. Your simple implementation is perfectly suitable for almost every situation.

Answer 2

The answers are out of date now, hence this newer answer.

With Spark 2.4 array functions you can something like this, some other aspects shown: as well but one can get the gist of it:

val res4 = res3.withColumn("_f", array_distinct(sort_array(flatten($"_e"))))

BTW a good read here: https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

Answer 3

You can use a simple UDF.

val dedup = udf((colName: scala.collection.mutable.WrappedArray[String]) => colName.distinct)
    
df.withColumn("DeDupColumn", dedup($"colName"))

Answer 4

Given your current dataframe schema as

root
 |-- date: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: integer (containsNull = false)

You can use following method for removing the duplicates.

df.map(row => DuplicateRemoved(row(0).toString, row(1).asInstanceOf[mutable.WrappedArray[Int]], row(1).asInstanceOf[mutable.WrappedArray[Int]].distinct)).toDF()

Of course you need a case class for this

case class DuplicateRemoved(date: String, products: mutable.WrappedArray[Int], rm_duplicates: mutable.WrappedArray[Int])

You should be getting following output

+-------------------+------------------------------+-------------------------+
|date               |products                      |rm_duplicates            |
+-------------------+------------------------------+-------------------------+
|2017-08-31 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-22 22:00:00|[361, 362, 362, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-21 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-28 22:00:00|[360, 361, 361, 362, 363, 364]|[360, 361, 362, 363, 364]|
+-------------------+------------------------------+-------------------------+

I hope the answer is helpful

Remove duplicates within Spark array column

Question

4 answers

solution1
3 2017-11-13 09:09:20

solution2
1 2020-09-20 11:20:39

solution3
1 2022-03-14 06:05:32

solution4
0 2017-11-13 09:13:43

Remove duplicates within Spark array column

Question

4 answers

solution1 3 2017-11-13 09:09:20

solution2 1 2020-09-20 11:20:39

solution3 1 2022-03-14 06:05:32

solution4 0 2017-11-13 09:13:43

solution1
3 2017-11-13 09:09:20

solution2
1 2020-09-20 11:20:39

solution3
1 2022-03-14 06:05:32

solution4
0 2017-11-13 09:13:43