简体   繁体   English

如何在Spark中合并一列记录?

[英]How to concat one column records in spark?

I have 2 columns and want to concatenate both, below is the small set of data: 我有2列,并且想将两者都连接起来,下面是一小组数据:

ID         Comments

32412     CLOSE AS NORMAL
32412     UNDER REVIEW 

I want this to be come as below ,so basically view is that grouping by ID and concatenate the comments. 我希望它如下,所以基本上视图是按ID分组并连接注释。

ID      Comments

32412  CLOSE AS NORMAL
       UNDER REVIEW

An alternate way to do this without using SQL query: 另一种无需使用SQL查询的方式:

import scala.collection.mutable

val myUDF = udf[String, mutable.WrappedArray[String]](_.mkString(" "))
df.groupBy($"id")
  .agg(collect_list("comments").as("comments"))
  .withColumn("comments", myUDF($"comments"))
  .show()

It requires HiveContext as SQLContext as well. 它也需要HiveContext作为SQLContext

You can use a UDF (user defined function) for this. 您可以为此使用UDF(用户定义的函数)。 Assuming you have a DataFrame named df with the data, you can try something like this: 假设您有一个名为dfDataFrame数据,则可以尝试如下操作:

import scala.collection.mutable
sqlContext.udf.register("ArrayToString",(a: mutable.WrappedArray[String]) => a.mkString("\n"))
df.registerTempTable("IDsAndComments")
val new_df = sqlContext.sql("WITH Data AS (SELECT ID, collect_list(Comments) AS cmnts FROM IDsAndComments GROUP BY ID) SELECT ID, ArrayToString(cmnts) AS Comments FROM Data")

What happens here is that you define a new function for the sqlContext to use when it parses the SQL code. 此处发生的是,您为sqlContext定义了一个新函数,以便在解析SQL代码时使用。 This function takes an WrappedArray (which is the type of array you get from Spark's DataFrames), and turn it to a string, where every element of the array is separated by a new line. 此函数采用WrappedArray (这是您从Spark的DataFrames获得的数组的类型),并将其转换为字符串,在该字符串中,数组的每个元素都由新行分隔。

The collect_list is function that returns an array of the values it grouped. collect_list是一个函数,它返回其分组的值的数组。 Note that it's a HiveContext function, so you need your sqlContext to be a HiveContext 请注意,这是一个HiveContext函数,因此您需要sqlContext才能成为HiveContext

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于一列的相同输入并连接其他列的火花合并数据集 - spark merge datasets based on the same input of one column and concat the others 如何在spark中使用逗号分隔符将相同的列值连接到新列 - how to concat the same column value to a new column with comma delimiters in spark 如何在 Spark 中将具有多条记录的列分解为多个列 - How to explode column with multiple records into multiple Columns in Spark 如何从火花中的现有列创建列 - how to create a column from an existing one in spark 如何在spark中合并多个列,同时使列名与另一个表连接(每行不同) - how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row) 如何在 spark DataFrame 中将多个浮点列连接到一个 ArrayType(FloatType()) 中? - How can I concat several float columns into one ArrayType(FloatType()) in spark DataFrame? 如何将数组和concat元素循环到一个print语句或变量Spark Scala中 - How to loop over array and concat elements into one print statement or variable Spark Scala 如果某一列是另一列的成员,如何过滤Spark数据帧 - How to filter Spark dataframe if one column is a member of another column spark sql concat 和 cast 操作导致列的空值 - spark sql concat and cast operations resulting null values for column 使用键列触发展平记录 - spark flatten records using a key column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM