简体   繁体   中英

Remove duplicates within Spark array column

I have a given DataSet:

+-------------------+--------------------+
|               date|            products|
+-------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|
|2017-09-21 22:00:00|[361, 361, 361, 3...|
|2017-09-28 22:00:00|[360, 361, 361, 3...|

where products column is an array of strings with possible duplicated items.

I would like to remove this duplication (within one row)

What I did is basically write an UDF function like that

 val removeDuplicates: WrappedArray[String] => WrappedArray[String] = _.distinct
 val udfremoveDuplicates = udf(removeDuplicates)

This solution gives me a proper results:

+-------------------+--------------------+--------------------+
|               date|            products|       rm_duplicates|
+-------------------+--------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|[361, 362, 363, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|[361, 362, 363, 3...|

My questions are:

  1. Do Spark provides a better/more efficient way of getting this result?

  2. I was thinking about using a map - but how to get desired column as a List to be able to use 'distinct' method like in my removeDuplicates lambda?

Edit: I marked this topic with java tag, because it does not matter to me in which language (scala or java) I will get an answear:) Edit2: typos

The approach presented in the question--using a UDF--is the best approach as spark-sql has no built-in primitive to uniquify arrays.

If you are dealing with massive amounts of data and/or the array values have unique properties then it's worth thinking about the implementation of the UDF .

WrappedArray.distinct builds a mutable.HashSet behind the scenes and then traverses it to build the array of distinct elements. There are two possible problems with this from a performance standpoint:

  1. Scala's mutable collections are not wonderfully efficient, which is why in the guts of Spark you'll find a lot of Java collections and while loops. If you are in need of extreme performance, you can implement your own generic distinct using faster data structures.

  2. A generic implementation of distinct does not take advantage of any properties of your data. For example, if the arrays will be small on average then a simple implementation that builds directly into an array and does a linear search for duplicates may perform much better than code that builds a complex data structure, despite it's theoretical O(n^2) complexity. For another example, if the values can only be numbers in a small range, or strings from a small set, you can implement uniquification via a bit set.

Again, these strategies should only be considered if you have ridiculous amounts of data. Your simple implementation is perfectly suitable for almost every situation.

The answers are out of date now, hence this newer answer.

With Spark 2.4 array functions you can something like this, some other aspects shown: as well but one can get the gist of it:

val res4 = res3.withColumn("_f", array_distinct(sort_array(flatten($"_e"))))

BTW a good read here: https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

You can use a simple UDF.

val dedup = udf((colName: scala.collection.mutable.WrappedArray[String]) => colName.distinct)
    
df.withColumn("DeDupColumn", dedup($"colName"))

Given your current dataframe schema as

root
 |-- date: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: integer (containsNull = false)

You can use following method for removing the duplicates.

df.map(row => DuplicateRemoved(row(0).toString, row(1).asInstanceOf[mutable.WrappedArray[Int]], row(1).asInstanceOf[mutable.WrappedArray[Int]].distinct)).toDF()

Of course you need a case class for this

case class DuplicateRemoved(date: String, products: mutable.WrappedArray[Int], rm_duplicates: mutable.WrappedArray[Int])

You should be getting following output

+-------------------+------------------------------+-------------------------+
|date               |products                      |rm_duplicates            |
+-------------------+------------------------------+-------------------------+
|2017-08-31 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-22 22:00:00|[361, 362, 362, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-21 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-28 22:00:00|[360, 361, 361, 362, 363, 364]|[360, 361, 362, 363, 364]|
+-------------------+------------------------------+-------------------------+

I hope the answer is helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM