处理 dataframe - Java Spark 中的多条记录

Question

I have the below dataframe.我有下面的 dataframe。

Column_1 Column_2
1        A
1        X
2        X
3        B
3        X
4        C
4        D

In the above dataframe, there can be multiple records for same value in column 1. I have to only remove those records which has more than entry and has an X in the column_2.在上面的 dataframe 中，第 1 列中可以有多个相同值的记录。我只需要删除那些具有多个条目并且在 column_2 中有 X 的记录。 If column 2 has 2 different values like C and D, i have to retain them.如果第 2 列有 2 个不同的值，例如 C 和 D，我必须保留它们。 Only when a record has multiple entries but one of those entries has X, i have to remove them from my dataframe.只有当一条记录有多个条目但其中一个条目有 X 时，我必须将它们从我的 dataframe 中删除。 Please note that if there is only one record with X in column_2, then we should not be removing that record.请注意，如果 column_2 中只有一条带有 X 的记录，那么我们不应该删除该记录。

Expected Output:预期 Output：

Column_1 Column_2
1        A
2        X
3        B
4        C
4        D

Kindly let me know if this can be achieved in Java Spark.请让我知道这是否可以在 Java Spark 中实现。 I was able to remove the X records altogether, but not sure how to achieve the above.我能够完全删除 X 记录，但不确定如何实现上述目标。

Thank you.谢谢你。

Answer 1

Complete working code with explanation inline, input csv looks like完整的工作代码，内联解释，输入 csv 看起来像

Column_1,Column_2
1,A
1,X
2,X
3,B
3,X
4,C
4,D

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import static org.apache.spark.sql.functions.*;

public class DropDups {

    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();
        Dataset<Row> ds = spark.read()
                .option("header", "true")
                .csv("src/main/resources/duplicateRec.csv");

        ds.show();

/* Ouputs
+--------+--------+
|Column_1|Column_2|
+--------+--------+
|       1|       A|
|       1|       X|
|       2|       X|
|       3|       B|
|       3|       X|
|       4|       C|
|       4|       D|
+--------+--------+
 */

        //Group by Column_1 and collect set of elements from Column_2 and remove 'X' from the set
        ds = ds.groupBy(ds.col("Column_1")).agg(
                array_remove(collect_set(ds.col("Column_2")), lit("X")).as("Column_2_list"));

        // if the set is empty then ["X"] else the actual set
        ds = ds.withColumn("Column_2_array",
                when(size(ds.col("Column_2_list")).equalTo(0), lit("X".split(",")))
                        .otherwise(ds.col("Column_2_list")));

        //Replace the column and drop the extra columns
        ds.withColumn("Column_2", explode(ds.col("Column_2_array")))
                .drop("Column_2_list", "Column_2_array")
                .show();
/* Ouputs
+--------+--------+
|Column_1|Column_2|
+--------+--------+
|       3|       B|
|       1|       A|
|       4|       C|
|       4|       D|
|       2|       X|
+--------+--------+
         */
    }
}

Answer 2

It's scala but java will look almost identical:它是 scala 但 java 看起来几乎相同：

df.withColumn("id",row_number().over(Window.orderBy("c_1").partitionBy("c_1")))
  .where(!('c_2==="X" and 'id > 1))

+----+----+
| c_1| c_2|
+----+----+
|   1|   A|
|   3|   B|
|   4|   C| 
|   4|   D|
|   2|   X|
+----+----+

处理 dataframe - Java Spark 中的多条记录

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-09 09:11:05

解决方案2
1 2020-06-09 09:24:22

处理 dataframe - Java Spark 中的多条记录

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-09 09:11:05

解决方案2 1 2020-06-09 09:24:22

解决方案1
1 已采纳 2020-06-09 09:11:05

解决方案2
1 2020-06-09 09:24:22