简体   繁体   中英

How to concat one column records in spark?

I have 2 columns and want to concatenate both, below is the small set of data:

ID         Comments

32412     CLOSE AS NORMAL
32412     UNDER REVIEW 

I want this to be come as below ,so basically view is that grouping by ID and concatenate the comments.

ID      Comments

32412  CLOSE AS NORMAL
       UNDER REVIEW

An alternate way to do this without using SQL query:

import scala.collection.mutable

val myUDF = udf[String, mutable.WrappedArray[String]](_.mkString(" "))
df.groupBy($"id")
  .agg(collect_list("comments").as("comments"))
  .withColumn("comments", myUDF($"comments"))
  .show()

It requires HiveContext as SQLContext as well.

You can use a UDF (user defined function) for this. Assuming you have a DataFrame named df with the data, you can try something like this:

import scala.collection.mutable
sqlContext.udf.register("ArrayToString",(a: mutable.WrappedArray[String]) => a.mkString("\n"))
df.registerTempTable("IDsAndComments")
val new_df = sqlContext.sql("WITH Data AS (SELECT ID, collect_list(Comments) AS cmnts FROM IDsAndComments GROUP BY ID) SELECT ID, ArrayToString(cmnts) AS Comments FROM Data")

What happens here is that you define a new function for the sqlContext to use when it parses the SQL code. This function takes an WrappedArray (which is the type of array you get from Spark's DataFrames), and turn it to a string, where every element of the array is separated by a new line.

The collect_list is function that returns an array of the values it grouped. Note that it's a HiveContext function, so you need your sqlContext to be a HiveContext

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM