[英]How to merging rows in a spark data set to combine a string column
I need to merge two or more rows in a dataset into one.我需要将数据集中的两行或多行合并为一行。 The grouping has to be done based on an id
column.必须根据id
列进行分组。 The column to be merged is a string.要合并的列是一个字符串。 I need to get a comma-separated string in the merged column.我需要在合并列中获取一个逗号分隔的字符串。 How do I achieve this is Java?我如何实现这个是 Java? Input rows输入行
col1,col2
1,abc
2,pqr
1,abc1
3,xyz
2,pqr1
Expected output:预计 output:
col1, col2
1, "abc,abc1"
2, "pqr,pqr1"
3, xyz
To aggregate two separate columns: 汇总两个单独的列:
your_data_frame
.withColumn("aggregated_column", concat_ws(",", col("col1"), col("col2"))
Just in case, here is what to import besides the usual stuff 以防万一,这是除了通常的东西之外还要导入的东西
import static org.apache.spark.sql.functions.*;
Edit 编辑
If you want to aggregate an arbitrary number of columns that you know by name, you can do it this way: 如果要聚合按名称知道的任意数量的列,可以采用以下方式:
String[] column_names = {"c1", "c2", "c3"};
Column[] columns = Arrays.asList(column_names)
.stream().map(x -> col(x))
.collect(Collectors.toList())
.toArray(new Column[0]);
data_frame
.withColumn("agg", concat_ws(",", columns));
Edit #2: group by and concat 编辑2:分组依据和合并
In case you want to group by a column "ID" and aggregate another column, you can do it this way: 如果要按“ ID”列分组并聚合另一列,则可以采用以下方式:
dataframe
.groupBy("ID")
.agg(concat_ws(",", collect_list(col("col1")) ))
Use groupBy and concat_ws使用 groupBy 和 concat_ws
import org.apache.spark.sql.functions._
df.groupBy("col1").agg(concat_ws(",", collect_list("col2")))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.