简体   繁体   English

如何合并 spark 数据集中的行以合并字符串列

[英]How to merging rows in a spark data set to combine a string column

I need to merge two or more rows in a dataset into one.我需要将数据集中的两行或多行合并为一行。 The grouping has to be done based on an id column.必须根据id列进行分组。 The column to be merged is a string.要合并的列是一个字符串。 I need to get a comma-separated string in the merged column.我需要在合并列中获取一个逗号分隔的字符串。 How do I achieve this is Java?我如何实现这个是 Java? Input rows输入行

col1,col2  
1,abc  
2,pqr  
1,abc1  
3,xyz
2,pqr1

Expected output:预计 output:

col1, col2  
1, "abc,abc1"  
2, "pqr,pqr1"  
3, xyz  

To aggregate two separate columns: 汇总两个单独的列:

your_data_frame
    .withColumn("aggregated_column", concat_ws(",", col("col1"), col("col2"))

Just in case, here is what to import besides the usual stuff 以防万一,这是除了通常的东西之外还要导入的东西

import static org.apache.spark.sql.functions.*;

Edit 编辑

If you want to aggregate an arbitrary number of columns that you know by name, you can do it this way: 如果要聚合按名称知道的任意数量的列,可以采用以下方式:

String[] column_names = {"c1", "c2", "c3"};
Column[] columns = Arrays.asList(column_names)
            .stream().map(x -> col(x))
            .collect(Collectors.toList())
            .toArray(new Column[0]);
data_frame
    .withColumn("agg", concat_ws(",", columns));

Edit #2: group by and concat 编辑2:分组依据和合并

In case you want to group by a column "ID" and aggregate another column, you can do it this way: 如果要按“ ID”列分组并聚合另一列,则可以采用以下方式:

dataframe
    .groupBy("ID")
    .agg(concat_ws(",", collect_list(col("col1")) ))

Use groupBy and concat_ws使用 groupBy 和 concat_ws

import org.apache.spark.sql.functions._
df.groupBy("col1").agg(concat_ws(",", collect_list("col2")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pyspark:如何将字符串类型的列分解为 spark 数据帧的行和列 - pyspark : How to explode a column of string type into rows and columns of a spark data frame 在spark scala 中将行合并为单个struct 列存在效率问题,我们如何做得更好? - Merging rows into a single struct column in spark scala has efficiency problems, how do we do it better? Spark sql - 如何在按特定列分组后连接字符串行 - Spark sql - How to concate string rows after grouping by particular column 如何使用 Spark 从大型数据集中 select n 行 - How to select n rows from large data set using spark 如何反转和合并Spark数据帧中的字符串列? - How to reverse and combine string columns in a spark dataframe? 如何将字符串列的数据类型更改为流水线中的第二个火花? - How to change the data type of a string column to double in spark as a stage in a pipeline? 如何将字符串连接到 Spark 中的列? - How to concatenate a string to a column in Spark? 如何使用java将数据集的两行合并为spark中的一行 - How to combine the two rows of a dataset into a single row in spark using java 如何组合来自火花数据块的行 - How can I combine rows from spark databricks 如何根据列值是否在Spark DataFrame的一组字符串中过滤行 - How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM