简体   繁体   English

删除 Spark 数组列中的重复项

[英]Remove duplicates within Spark array column

I have a given DataSet:我有一个给定的数据集:

+-------------------+--------------------+
|               date|            products|
+-------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|
|2017-09-21 22:00:00|[361, 361, 361, 3...|
|2017-09-28 22:00:00|[360, 361, 361, 3...|

where products column is an array of strings with possible duplicated items.其中 products 列是一个包含可能重复项目的字符串数组。

I would like to remove this duplication (within one row)我想删除这个重复(在一行内)

What I did is basically write an UDF function like that我所做的基本上就是像那样写一个 UDF function

 val removeDuplicates: WrappedArray[String] => WrappedArray[String] = _.distinct
 val udfremoveDuplicates = udf(removeDuplicates)

This solution gives me a proper results:这个解决方案给了我一个正确的结果:

+-------------------+--------------------+--------------------+
|               date|            products|       rm_duplicates|
+-------------------+--------------------+--------------------+
|2017-08-31 22:00:00|[361, 361, 361, 3...|[361, 362, 363, 3...|
|2017-09-22 22:00:00|[361, 362, 362, 3...|[361, 362, 363, 3...|

My questions are:我的问题是:

  1. Do Spark provides a better/more efficient way of getting this result? Spark 是否提供了获得此结果的更好/更有效的方法?

  2. I was thinking about using a map - but how to get desired column as a List to be able to use 'distinct' method like in my removeDuplicates lambda?我正在考虑使用 map - 但是如何获得所需的列作为列表以便能够像我的removeDuplicates lambda 中那样使用“不同”方法?

Edit: I marked this topic with java tag, because it does not matter to me in which language (scala or java) I will get an answear:) Edit2: typos编辑:我用 java 标签标记了这个主题,因为我会用哪种语言(scala 或 java)得到一个答案对我来说并不重要:) Edit2: typos

The approach presented in the question--using a UDF--is the best approach as spark-sql has no built-in primitive to uniquify arrays.问题中提出的方法——使用 UDF——是最好的方法,因为spark-sql没有用于统一数组的内置原语。

If you are dealing with massive amounts of data and/or the array values have unique properties then it's worth thinking about the implementation of the UDF .如果您正在处理大量数据和/或数组值具有独特的属性,那么值得考虑 UDF 的实现

WrappedArray.distinct builds a mutable.HashSet behind the scenes and then traverses it to build the array of distinct elements. WrappedArray.distinct在幕后构建mutable.HashSet ,然后遍历它以构建不同元素的数组。 There are two possible problems with this from a performance standpoint:从性能的角度来看,这有两个可能的问题:

  1. Scala's mutable collections are not wonderfully efficient, which is why in the guts of Spark you'll find a lot of Java collections and while loops. Scala 的可变集合并不是非常高效,这就是为什么在 Spark 的内部你会发现很多 Java 集合和while循环。 If you are in need of extreme performance, you can implement your own generic distinct using faster data structures.如果您需要极高的性能,您可以使用更快的数据结构来实现您自己的通用 distinct。

  2. A generic implementation of distinct does not take advantage of any properties of your data. distinct的通用实现不会利用数据的任何属性。 For example, if the arrays will be small on average then a simple implementation that builds directly into an array and does a linear search for duplicates may perform much better than code that builds a complex data structure, despite it's theoretical O(n^2) complexity.例如,如果数组平均很小,那么直接构建到数组中并对重复项进行线性搜索的简单实现可能比构建复杂数据结构的代码性能好得多,尽管它是理论上的O(n^2)复杂性。 For another example, if the values can only be numbers in a small range, or strings from a small set, you can implement uniquification via a bit set.再比如,如果值只能是小范围内的数字,或者是小集合中的字符串,则可以通过位集实现唯一化。

Again, these strategies should only be considered if you have ridiculous amounts of data.同样,只有在您拥有大量数据时才应考虑这些策略。 Your simple implementation is perfectly suitable for almost every situation.您的简单实现几乎适用于所有情况。

The answers are out of date now, hence this newer answer.答案现在已经过时了,因此这个较新的答案。

With Spark 2.4 array functions you can something like this, some other aspects shown: as well but one can get the gist of it:使用 Spark 2.4 数组函数,您可以执行以下操作,还显示了其他一些方面:以及但可以了解其要点:

val res4 = res3.withColumn("_f", array_distinct(sort_array(flatten($"_e"))))

BTW a good read here: https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read顺便说一句,这里有一个很好的阅读: https : //www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read

You can use a simple UDF.您可以使用简单的 UDF。

val dedup = udf((colName: scala.collection.mutable.WrappedArray[String]) => colName.distinct)
    
df.withColumn("DeDupColumn", dedup($"colName"))

Given your current dataframe schema as鉴于当前的dataframe schema

root
 |-- date: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: integer (containsNull = false)

You can use following method for removing the duplicates.您可以使用以下方法删除重复项。

df.map(row => DuplicateRemoved(row(0).toString, row(1).asInstanceOf[mutable.WrappedArray[Int]], row(1).asInstanceOf[mutable.WrappedArray[Int]].distinct)).toDF()

Of course you need a case class for this当然你需要一个case class

case class DuplicateRemoved(date: String, products: mutable.WrappedArray[Int], rm_duplicates: mutable.WrappedArray[Int])

You should be getting following output你应该得到以下输出

+-------------------+------------------------------+-------------------------+
|date               |products                      |rm_duplicates            |
+-------------------+------------------------------+-------------------------+
|2017-08-31 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-22 22:00:00|[361, 362, 362, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-21 22:00:00|[361, 361, 361, 362, 363, 364]|[361, 362, 363, 364]     |
|2017-09-28 22:00:00|[360, 361, 361, 362, 363, 364]|[360, 361, 362, 363, 364]|
+-------------------+------------------------------+-------------------------+

I hope the answer is helpful我希望答案有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM