[英]How to concat one column records in spark?
I have 2 columns and want to concatenate both, below is the small set of data: 我有2列,并且想将两者都连接起来,下面是一小组数据:
ID Comments
32412 CLOSE AS NORMAL
32412 UNDER REVIEW
I want this to be come as below ,so basically view is that grouping by ID and concatenate the comments. 我希望它如下,所以基本上视图是按ID分组并连接注释。
ID Comments
32412 CLOSE AS NORMAL
UNDER REVIEW
An alternate way to do this without using SQL query: 另一种无需使用SQL查询的方式:
import scala.collection.mutable
val myUDF = udf[String, mutable.WrappedArray[String]](_.mkString(" "))
df.groupBy($"id")
.agg(collect_list("comments").as("comments"))
.withColumn("comments", myUDF($"comments"))
.show()
It requires HiveContext
as SQLContext
as well. 它也需要
HiveContext
作为SQLContext
。
You can use a UDF (user defined function) for this. 您可以为此使用UDF(用户定义的函数)。 Assuming you have a
DataFrame
named df
with the data, you can try something like this: 假设您有一个名为
df
的DataFrame
数据,则可以尝试如下操作:
import scala.collection.mutable
sqlContext.udf.register("ArrayToString",(a: mutable.WrappedArray[String]) => a.mkString("\n"))
df.registerTempTable("IDsAndComments")
val new_df = sqlContext.sql("WITH Data AS (SELECT ID, collect_list(Comments) AS cmnts FROM IDsAndComments GROUP BY ID) SELECT ID, ArrayToString(cmnts) AS Comments FROM Data")
What happens here is that you define a new function for the sqlContext
to use when it parses the SQL code. 此处发生的是,您为
sqlContext
定义了一个新函数,以便在解析SQL代码时使用。 This function takes an WrappedArray
(which is the type of array you get from Spark's DataFrames), and turn it to a string, where every element of the array is separated by a new line. 此函数采用
WrappedArray
(这是您从Spark的DataFrames获得的数组的类型),并将其转换为字符串,在该字符串中,数组的每个元素都由新行分隔。
The collect_list
is function that returns an array of the values it grouped. collect_list
是一个函数,它返回其分组的值的数组。 Note that it's a HiveContext
function, so you need your sqlContext
to be a HiveContext
请注意,这是一个
HiveContext
函数,因此您需要sqlContext
才能成为HiveContext
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.