简体   繁体   English

将Array [string]类型的两个spark sql列合并到一个新的Array [string]列中

[英]Merge two spark sql columns of type Array[string] into a new Array[string] column

I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings. 我在Spark SQL DataFrame有两列,每列中的每个条目都是一个字符串数组。

val  ngramDataFrame = Seq(
  (Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")

I want to merge the arrays in each row to make a single array in a new column. 我想合并每行中的数组,以在新列中生成单个数组。 My code is as follows: 我的代码如下:

def concat_array(firstarray: Array[String], 
                 secondarray: Array[String]) : Array[String] = 
                                     { (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))

I can successfully use the concat_array function on two arrays. 我可以在两个数组上成功使用concat_array函数。 However when I run the above code, I get the following exception: 但是,当我运行上面的代码时,我得到以下异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.ap org.apache.spark.SparkException:作业因阶段失败而中止:阶段16.0中的任务0失败1次,最近失败:阶段16.0中失去的任务0.0(TID 12,localhost):org.apache.spark.SparkException:失败在org.apache.spark.sql.execution的org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)执行用户定义的函数(anonfun $ 1 :(数组,数组)=>数组) .BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:370)at scala.collection.Iterator $$ anon $ 10 .hasNext(Iterator.scala:389)at sca.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)at at Org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)atg.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)at org.ap ache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; ache.spark.scheduler.Task.run(Task.scala:86)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor。 java:1149)java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)引起:java.lang.ClassCastException:scala.collection .mutable.WrappedArray $ ofRef无法强制转换为[Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace: 在$ line80。$ read $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ anonfun $ 1.apply(:76)... 13更多驱动程序堆栈跟踪:

In Spark 2.4 or later you can use concat (if you want to keep duplicates): 在Spark 2.4或更高版本中,您可以使用concat (如果要保留重复项):

ngramDataFrame.withColumn(
  "full_array", concat($"filtered_words", $"ngrams_array")
).show
+--------------------+---------------+--------------------+
|      filtered_words|   ngrams_array|          full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+

or array_union (if you want to drop duplicates): array_union (如果要删除重复项):

ngramDataFrame.withColumn(
  "full_array",
   array_union($"filtered_words", $"ngrams_array")
)

These can be also composed from the other higher order functions, for example 例如,这些也可以由其他更高阶的函数组成

ngramDataFrame.withColumn(
   "full_array",
   flatten(array($"filtered_words", $"ngrams_array"))
)

with duplicates, and 与重复,和

ngramDataFrame.withColumn(
   "full_array",
   array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)

without. 没有。

On a side note, you shouldn't use WrappedArray when working with ArrayType columns. 另外,在使用ArrayType列时,不应使用WrappedArray Instead you should expect the guaranteed interface, which is Seq . 相反,你应该期待保证的接口,即Seq So the udf should use function with following signature: 所以udf应该使用具有以下签名的函数:

(Seq[String], Seq[String]) => Seq[String]

Please refer to SQL Programming Guide for details. 有关详细信息,请参阅SQL编程指南

Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output. Arjun在你创建的udf中有一个错误。当你传递数组类型列.data类型不是Array [String]时,它是WrappedArray [String]。我正在粘贴修改后的udf和输出。

val SparkCtxt = new SparkContext(sparkConf)

val sqlContext = new SQLContext(SparkCtxt)

import sqlContext.implicits

import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
  StructType(List(
    StructField("Col1",ArrayType(StringType),true),
    StructField("Col2",ArrayType(StringType),true)
  )
  )    )

def concat_array(firstarray: mutable.WrappedArray[String],
                 secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
 (firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)

OUTPUT: OUTPUT:

+------------------+------------------+--------------------+
|              Col1|              Col2|             udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+

WrappedArray(String1, String2, String3, String4) WrappedArray(String1,String2,String3,String4)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark - 合并两列数组结构类型 - Spark - Merge two columns of array struct type Array[Array[String]] to String in a column with Scala 和 Spark - Array[Array[String]] to String in a column with Scala and Spark 使用Spark Dataframe Scala将Array [Double]列转换为字符串或两个不同的列 - Converting an Array[Double] Column into a string or two different columns with Spark Dataframe Scala Spark Scala从多列中获取字符串类型的数组 - Spark scala get an array of type string from multiple columns 将字符串数组列转换为Spark Scala中的多个列 - Convert Array of String column to multiple columns in spark scala 如何将 Spark Dataframe 列转换为字符串数组的单列 - How to transform Spark Dataframe columns to a single column of a string array Spark:将字符串列转换为数组 - Spark: Convert column of string to an array 如何在 Spark 中将 dataframe 列类型从字符串转换为(数组和结构) - How to convert the dataframe column type from string to (array and struct) in spark 在火花中将Array [(String,String)]类型转换为RDD [(String,String)]类型 - Convert Array[(String,String)] type to RDD[(String,String)] type in spark Spark - java.lang.ClassCastException 处理成 udf 类型为 Array[Array[Map[String,String]]] 的列时 - Spark - java.lang.ClassCastException when processing into a udf a column of type Array[Array[Map[String,String]]]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM