[英]Java Spark remove duplicates/nulls and preserve order
I have the below Java Spark dataset/dataframe.我有以下 Java Spark 数据集/数据框。
Col_1 Col_2 Col_3 ...
A 1 1
A 1 NULL
B 2 2
B 2 3
C 1 NULL
There are close to 25 columns in this dataset and I have to remove those records which are duplicated on Col_1.该数据集中有近 25 列,我必须删除在 Col_1 上重复的那些记录。 If the second record is NULL, then NULL has to be removed (like in case of COl_1 = A) and if there are multiple valid values like in case of Col_1 = B then only one valid Col_2 = 2 and Col_3 = 2 should only be retained everytime.如果第二条记录是 NULL,则必须删除 NULL(如 COl_1 = A 的情况),如果有多个有效值,如 Col_1 = B 的情况,则只有一个有效的 Col_2 = 2 和 Col_3 = 2每次都保留。 If there is only one record with null like in case of Col_1 = C.如果只有一条 null 记录,例如 Col_1 = C。 then it has to be retained那么它必须被保留
Expected Output:预期 Output:
Col_1 Col_2 Col_3 ...
A 1 1
B 2 2
C 1 NULL
What i tried so far:到目前为止我尝试了什么:
I tried using group by and collect set with sort_array and array_remove but it removes the nulls altogether even if there is one row.我尝试使用 group by 并使用 sort_array 和 array_remove 收集集合,但即使只有一行,它也会完全删除空值。
How to achieve the expected output in Java Spark.如何在 Java Spark 中实现预期的 output。
This is how you can do it using spark dataframes:这是使用 spark 数据帧的方法:
import org.apache.spark.sql.functions.{coalesce, col, lit, min, struct}
val rows = Seq(
("A",1,Some(1)),
("A",1, Option.empty[Int]),
("B",2,Some(2)),
("B",2,Some(3)),
("C",1,Option.empty[Int]))
.toDF("Col_1", "Col_2", "Col_3")
rows.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| A| 1| 1|
| A| 1| null|
| B| 2| 2|
| B| 2| 3|
| C| 1| null|
+-----+-----+-----+
val deduped = rows.groupBy(col("Col_1"))
.agg(
min(
struct(
coalesce(col("Col_3"), lit(Int.MaxValue)).as("null_maxed"),
col("Col_2"),
col("Col_3"))).as("argmax"))
.select(col("Col_1"), col("argmax.Col_2"), col("argmax.Col_3"))
deduped.show()
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
| B| 2| 2|
| C| 1| null|
| A| 1| 1|
+-----+-----+-----+
Whats happening here is you are grouping by Col_1
and then getting the minimum of a composite struct of Col_3
and Col_2
but nulls in Col_3
have been replaced with the max integer value so they don't impact the ordering.这里发生的事情是您按Col_1
分组,然后获得Col_3
和Col_2
复合结构的最小值,但Col_3
中的空值已被最大 integer 值替换,因此它们不会影响排序。 We then select the original Col_3
and Col_2
from the resulting row.然后我们 select 来自结果行的原始Col_3
和Col_2
。 I realise this is in scala but the syntax for java should be very similar.我意识到这是在 scala 但 java 的语法应该非常相似。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.