[英]Handling multiple records in dataframe - Java Spark
I have the below dataframe.我有下面的 dataframe。
Column_1 Column_2
1 A
1 X
2 X
3 B
3 X
4 C
4 D
In the above dataframe, there can be multiple records for same value in column 1. I have to only remove those records which has more than entry and has an X in the column_2.在上面的 dataframe 中,第 1 列中可以有多个相同值的记录。我只需要删除那些具有多个条目并且在 column_2 中有 X 的记录。 If column 2 has 2 different values like C and D, i have to retain them.
如果第 2 列有 2 个不同的值,例如 C 和 D,我必须保留它们。 Only when a record has multiple entries but one of those entries has X, i have to remove them from my dataframe.
只有当一条记录有多个条目但其中一个条目有 X 时,我必须将它们从我的 dataframe 中删除。 Please note that if there is only one record with X in column_2, then we should not be removing that record.
请注意,如果 column_2 中只有一条带有 X 的记录,那么我们不应该删除该记录。
Expected Output:预期 Output:
Column_1 Column_2
1 A
2 X
3 B
4 C
4 D
Kindly let me know if this can be achieved in Java Spark.请让我知道这是否可以在 Java Spark 中实现。 I was able to remove the X records altogether, but not sure how to achieve the above.
我能够完全删除 X 记录,但不确定如何实现上述目标。
Thank you.谢谢你。
Complete working code with explanation inline, input csv looks like完整的工作代码,内联解释,输入 csv 看起来像
Column_1,Column_2
1,A
1,X
2,X
3,B
3,X
4,C
4,D
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;
public class DropDups {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();
Dataset<Row> ds = spark.read()
.option("header", "true")
.csv("src/main/resources/duplicateRec.csv");
ds.show();
/* Ouputs
+--------+--------+
|Column_1|Column_2|
+--------+--------+
| 1| A|
| 1| X|
| 2| X|
| 3| B|
| 3| X|
| 4| C|
| 4| D|
+--------+--------+
*/
//Group by Column_1 and collect set of elements from Column_2 and remove 'X' from the set
ds = ds.groupBy(ds.col("Column_1")).agg(
array_remove(collect_set(ds.col("Column_2")), lit("X")).as("Column_2_list"));
// if the set is empty then ["X"] else the actual set
ds = ds.withColumn("Column_2_array",
when(size(ds.col("Column_2_list")).equalTo(0), lit("X".split(",")))
.otherwise(ds.col("Column_2_list")));
//Replace the column and drop the extra columns
ds.withColumn("Column_2", explode(ds.col("Column_2_array")))
.drop("Column_2_list", "Column_2_array")
.show();
/* Ouputs
+--------+--------+
|Column_1|Column_2|
+--------+--------+
| 3| B|
| 1| A|
| 4| C|
| 4| D|
| 2| X|
+--------+--------+
*/
}
}
It's scala but java will look almost identical:它是 scala 但 java 看起来几乎相同:
df.withColumn("id",row_number().over(Window.orderBy("c_1").partitionBy("c_1")))
.where(!('c_2==="X" and 'id > 1))
+----+----+
| c_1| c_2|
+----+----+
| 1| A|
| 3| B|
| 4| C|
| 4| D|
| 2| X|
+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.