如何在Spark中合并一列记录？

Question

I have 2 columns and want to concatenate both, below is the small set of data: 我有2列，并且想将两者都连接起来，下面是一小组数据：

ID         Comments

32412     CLOSE AS NORMAL
32412     UNDER REVIEW

I want this to be come as below ,so basically view is that grouping by ID and concatenate the comments. 我希望它如下，所以基本上视图是按ID分组并连接注释。

ID      Comments

32412  CLOSE AS NORMAL
       UNDER REVIEW

Answer 1

An alternate way to do this without using SQL query: 另一种无需使用SQL查询的方式：

import scala.collection.mutable

val myUDF = udf[String, mutable.WrappedArray[String]](_.mkString(" "))
df.groupBy($"id")
  .agg(collect_list("comments").as("comments"))
  .withColumn("comments", myUDF($"comments"))
  .show()

It requires HiveContext as SQLContext as well. 它也需要HiveContext作为SQLContext 。

Answer 2

You can use a UDF (user defined function) for this. 您可以为此使用UDF（用户定义的函数）。 Assuming you have a DataFrame named df with the data, you can try something like this: 假设您有一个名为df的DataFrame数据，则可以尝试如下操作：

import scala.collection.mutable
sqlContext.udf.register("ArrayToString",(a: mutable.WrappedArray[String]) => a.mkString("\n"))
df.registerTempTable("IDsAndComments")
val new_df = sqlContext.sql("WITH Data AS (SELECT ID, collect_list(Comments) AS cmnts FROM IDsAndComments GROUP BY ID) SELECT ID, ArrayToString(cmnts) AS Comments FROM Data")

What happens here is that you define a new function for the sqlContext to use when it parses the SQL code. 此处发生的是，您为sqlContext定义了一个新函数，以便在解析SQL代码时使用。 This function takes an WrappedArray (which is the type of array you get from Spark's DataFrames), and turn it to a string, where every element of the array is separated by a new line. 此函数采用WrappedArray （这是您从Spark的DataFrames获得的数组的类型），并将其转换为字符串，在该字符串中，数组的每个元素都由新行分隔。

The collect_list is function that returns an array of the values it grouped. collect_list是一个函数，它返回其分组的值的数组。 Note that it's a HiveContext function, so you need your sqlContext to be a HiveContext 请注意，这是一个HiveContext函数，因此您需要sqlContext才能成为HiveContext

如何在Spark中合并一列记录？

问题描述

2 个解决方案

解决方案1
2 2016-08-26 11:40:01

解决方案2
1 2016-08-26 10:55:26

如何在Spark中合并一列记录？

问题描述

2 个解决方案

解决方案1 2 2016-08-26 11:40:01

解决方案2 1 2016-08-26 10:55:26

解决方案1
2 2016-08-26 11:40:01

解决方案2
1 2016-08-26 10:55:26