简体   繁体   English

Java Spark 删除重复项/空值并保留顺序

[英]Java Spark remove duplicates/nulls and preserve order

I have the below Java Spark dataset/dataframe.我有以下 Java Spark 数据集/数据框。

Col_1 Col_2 Col_3 ...
A     1     1
A     1     NULL
B     2     2
B     2     3
C     1     NULL

There are close to 25 columns in this dataset and I have to remove those records which are duplicated on Col_1.该数据集中有近 25 列,我必须删除在 Col_1 上重复的那些记录。 If the second record is NULL, then NULL has to be removed (like in case of COl_1 = A) and if there are multiple valid values like in case of Col_1 = B then only one valid Col_2 = 2 and Col_3 = 2 should only be retained everytime.如果第二条记录是 NULL,则必须删除 NULL(如 COl_1 = A 的情况),如果有多个有效值,如 Col_1 = B 的情况,则只有一个有效的 Col_2 = 2 和 Col_3 = 2每次都保留。 If there is only one record with null like in case of Col_1 = C.如果只有一条 null 记录,例如 Col_1 = C。 then it has to be retained那么它必须被保留

Expected Output:预期 Output:

Col_1 Col_2 Col_3 ...
A     1     1
B     2     2
C     1     NULL

What i tried so far:到目前为止我尝试了什么:

I tried using group by and collect set with sort_array and array_remove but it removes the nulls altogether even if there is one row.我尝试使用 group by 并使用 sort_array 和 array_remove 收集集合,但即使只有一行,它也会完全删除空值。

How to achieve the expected output in Java Spark.如何在 Java Spark 中实现预期的 output。

This is how you can do it using spark dataframes:这是使用 spark 数据帧的方法:

import org.apache.spark.sql.functions.{coalesce, col, lit, min, struct}

val rows = Seq(
  ("A",1,Some(1)),
  ("A",1, Option.empty[Int]),
  ("B",2,Some(2)),
  ("B",2,Some(3)),
  ("C",1,Option.empty[Int]))
  .toDF("Col_1", "Col_2", "Col_3")

rows.show()

+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|    A|    1|    1|
|    A|    1| null|
|    B|    2|    2|
|    B|    2|    3|
|    C|    1| null|
+-----+-----+-----+

val deduped = rows.groupBy(col("Col_1"))
  .agg(
    min(
      struct(
        coalesce(col("Col_3"), lit(Int.MaxValue)).as("null_maxed"), 
        col("Col_2"), 
        col("Col_3"))).as("argmax"))
  .select(col("Col_1"), col("argmax.Col_2"), col("argmax.Col_3"))

deduped.show()

+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|    B|    2|    2|
|    C|    1| null| 
|    A|    1|    1|
+-----+-----+-----+

Whats happening here is you are grouping by Col_1 and then getting the minimum of a composite struct of Col_3 and Col_2 but nulls in Col_3 have been replaced with the max integer value so they don't impact the ordering.这里发生的事情是您按Col_1分组,然后获得Col_3Col_2复合结构的最小值,但Col_3中的空值已被最大 integer 值替换,因此它们不会影响排序。 We then select the original Col_3 and Col_2 from the resulting row.然后我们 select 来自结果行的原始Col_3Col_2 I realise this is in scala but the syntax for java should be very similar.我意识到这是在 scala 但 java 的语法应该非常相似。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM