I have 2 columns and want to concatenate both, below is the small set of data:
ID Comments
32412 CLOSE AS NORMAL
32412 UNDER REVIEW
I want this to be come as below ,so basically view is that grouping by ID and concatenate the comments.
ID Comments
32412 CLOSE AS NORMAL
UNDER REVIEW
An alternate way to do this without using SQL query:
import scala.collection.mutable
val myUDF = udf[String, mutable.WrappedArray[String]](_.mkString(" "))
df.groupBy($"id")
.agg(collect_list("comments").as("comments"))
.withColumn("comments", myUDF($"comments"))
.show()
It requires HiveContext
as SQLContext
as well.
You can use a UDF (user defined function) for this. Assuming you have a DataFrame
named df
with the data, you can try something like this:
import scala.collection.mutable
sqlContext.udf.register("ArrayToString",(a: mutable.WrappedArray[String]) => a.mkString("\n"))
df.registerTempTable("IDsAndComments")
val new_df = sqlContext.sql("WITH Data AS (SELECT ID, collect_list(Comments) AS cmnts FROM IDsAndComments GROUP BY ID) SELECT ID, ArrayToString(cmnts) AS Comments FROM Data")
What happens here is that you define a new function for the sqlContext
to use when it parses the SQL code. This function takes an WrappedArray
(which is the type of array you get from Spark's DataFrames), and turn it to a string, where every element of the array is separated by a new line.
The collect_list
is function that returns an array of the values it grouped. Note that it's a HiveContext
function, so you need your sqlContext
to be a HiveContext
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.