如何合并 spark 数据集中的行以合并字符串列

Question

I need to merge two or more rows in a dataset into one.我需要将数据集中的两行或多行合并为一行。 The grouping has to be done based on an id column.必须根据id列进行分组。 The column to be merged is a string.要合并的列是一个字符串。 I need to get a comma-separated string in the merged column.我需要在合并列中获取一个逗号分隔的字符串。 How do I achieve this is Java?我如何实现这个是 Java？ Input rows输入行

col1,col2  
1,abc  
2,pqr  
1,abc1  
3,xyz
2,pqr1

Expected output:预计 output：

col1, col2  
1, "abc,abc1"  
2, "pqr,pqr1"  
3, xyz

Answer 1

To aggregate two separate columns: 汇总两个单独的列：

your_data_frame
    .withColumn("aggregated_column", concat_ws(",", col("col1"), col("col2"))

Just in case, here is what to import besides the usual stuff 以防万一，这是除了通常的东西之外还要导入的东西

import static org.apache.spark.sql.functions.*;

Edit 编辑

If you want to aggregate an arbitrary number of columns that you know by name, you can do it this way: 如果要聚合按名称知道的任意数量的列，可以采用以下方式：

String[] column_names = {"c1", "c2", "c3"};
Column[] columns = Arrays.asList(column_names)
            .stream().map(x -> col(x))
            .collect(Collectors.toList())
            .toArray(new Column[0]);
data_frame
    .withColumn("agg", concat_ws(",", columns));

Edit #2: group by and concat 编辑2：分组依据和合并

In case you want to group by a column "ID" and aggregate another column, you can do it this way: 如果要按“ ID”列分组并聚合另一列，则可以采用以下方式：

dataframe
    .groupBy("ID")
    .agg(concat_ws(",", collect_list(col("col1")) ))

Answer 2

Use groupBy and concat_ws使用 groupBy 和 concat_ws

import org.apache.spark.sql.functions._
df.groupBy("col1").agg(concat_ws(",", collect_list("col2")))

如何合并 spark 数据集中的行以合并字符串列

问题描述

2 个解决方案

解决方案1
2 2017-12-18 15:07:23

解决方案2
0 2021-06-25 19:27:52

如何合并 spark 数据集中的行以合并字符串列

问题描述

2 个解决方案

解决方案1 2 2017-12-18 15:07:23

解决方案2 0 2021-06-25 19:27:52

解决方案1
2 2017-12-18 15:07:23

解决方案2
0 2021-06-25 19:27:52