将Array [string]类型的两个spark sql列合并到一个新的Array [string]列中

Question

I have two columns in a Spark SQL DataFrame with each entry in either column as an array of strings. 我在Spark SQL DataFrame有两列，每列中的每个条目都是一个字符串数组。

val  ngramDataFrame = Seq(
  (Seq("curious", "bought", "20"), Seq("iwa", "was", "asj"))
).toDF("filtered_words", "ngrams_array")

I want to merge the arrays in each row to make a single array in a new column. 我想合并每行中的数组，以在新列中生成单个数组。 My code is as follows: 我的代码如下：

def concat_array(firstarray: Array[String], 
                 secondarray: Array[String]) : Array[String] = 
                                     { (firstarray ++ secondarray).toArray }
val concatUDF = udf(concat_array _)
val concatFrame = ngramDataFrame.withColumn("full_array", concatUDF($"filtered_words", $"ngrams_array"))

I can successfully use the concat_array function on two arrays. 我可以在两个数组上成功使用concat_array函数。 However when I run the above code, I get the following exception: 但是，当我运行上面的代码时，我得到以下异常：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 12, localhost): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array, array) => array) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.ap org.apache.spark.SparkException：作业因阶段失败而中止：阶段16.0中的任务0失败1次，最近失败：阶段16.0中失去的任务0.0（TID 12，localhost）：org.apache.spark.SparkException：失败在org.apache.spark.sql.execution的org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext（未知来源）执行用户定义的函数（anonfun $ 1 :(数组，数组）=>数组） .BufferedRowIterator.hasNext（BufferedRowIterator.java:43）at org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext（WholeStageCodegenExec.scala：370）at scala.collection.Iterator $$ anon $ 10 .hasNext（Iterator.scala：389）at sca.collection.Iterator $$ anon $ 11.hasNext（Iterator.scala：408）at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write（BypassMergeSortShuffleWriter.java:125）at at Org.apache.spark.scheduler.ShuffleMapTask.runTask（ShuffleMapTask.scala：79）atg.apache.spark.scheduler.ShuffleMapTask.runTask（ShuffleMapTask.scala：47）at org.ap ache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; ache.spark.scheduler.Task.run（Task.scala：86）at org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：274）at java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor。 java：1149）java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）at java.lang.Thread.run（Thread.java:748）引起：java.lang.ClassCastException：scala.collection .mutable.WrappedArray $ ofRef无法强制转换为[Ljava.lang.String; at $line80.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:76) ... 13 more Driver stacktrace: 在$ line80。$ read $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ iw $$ anonfun $ 1.apply（：76）... 13更多驱动程序堆栈跟踪：

Answer 1

In Spark 2.4 or later you can use concat (if you want to keep duplicates): 在Spark 2.4或更高版本中，您可以使用concat （如果要保留重复项）：

ngramDataFrame.withColumn(
  "full_array", concat($"filtered_words", $"ngrams_array")
).show

+--------------------+---------------+--------------------+
|      filtered_words|   ngrams_array|          full_array|
+--------------------+---------------+--------------------+
|[curious, bought,...|[iwa, was, asj]|[curious, bought,...|
+--------------------+---------------+--------------------+

or array_union (if you want to drop duplicates): 或array_union （如果要删除重复项）：

ngramDataFrame.withColumn(
  "full_array",
   array_union($"filtered_words", $"ngrams_array")
)

These can be also composed from the other higher order functions, for example 例如，这些也可以由其他更高阶的函数组成

ngramDataFrame.withColumn(
   "full_array",
   flatten(array($"filtered_words", $"ngrams_array"))
)

with duplicates, and 与重复，和

ngramDataFrame.withColumn(
   "full_array",
   array_distinct(flatten(array($"filtered_words", $"ngrams_array")))
)

without. 没有。

On a side note, you shouldn't use WrappedArray when working with ArrayType columns. 另外，在使用ArrayType列时，不应使用WrappedArray 。 Instead you should expect the guaranteed interface, which is Seq . 相反，你应该期待保证的接口，即Seq 。 So the udf should use function with following signature: 所以udf应该使用具有以下签名的函数：

(Seq[String], Seq[String]) => Seq[String]

Please refer to SQL Programming Guide for details. 有关详细信息，请参阅SQL编程指南。

Answer 2

Arjun there is an error in the udf you had created.when you are passing the array type columns .data type is not Array[String] it is WrappedArray[String].below i am pasting the modified udf along with output. Arjun在你创建的udf中有一个错误。当你传递数组类型列.data类型不是Array [String]时，它是WrappedArray [String]。我正在粘贴修改后的udf和输出。

val SparkCtxt = new SparkContext(sparkConf)

val sqlContext = new SQLContext(SparkCtxt)

import sqlContext.implicits

import org.apache.spark.sql.functions._
val temp=SparkCtxt.parallelize(Seq(Row(Array("String1","String2"),Array("String3","String4"))))
val df= sqlContext.createDataFrame(temp,
  StructType(List(
    StructField("Col1",ArrayType(StringType),true),
    StructField("Col2",ArrayType(StringType),true)
  )
  )    )

def concat_array(firstarray: mutable.WrappedArray[String],
                 secondarray: mutable.WrappedArray[String]) : mutable.WrappedArray[String] =
{
 (firstarray ++ secondarray)
}
val concatUDF = udf(concat_array _)
val df2=df.withColumn("udftest",concatUDF(df.col("Col1"), df.col("Col2")))
df2.select("udftest").foreach(each=>{println("***********")
println(each(0))})
df2.show(true)

OUTPUT: OUTPUT：

+------------------+------------------+--------------------+
|              Col1|              Col2|             udftest|
+------------------+------------------+--------------------+
|[String1, String2]|[String3, String4]|[String1, String2...|
+------------------+------------------+--------------------+

WrappedArray(String1, String2, String3, String4) WrappedArray（String1，String2，String3，String4）

将Array [string]类型的两个spark sql列合并到一个新的Array [string]列中

问题描述

2 个解决方案

解决方案1
9 2018-10-03 11:51:29

解决方案2
2 已采纳 2018-03-08 10:24:02

将Array [string]类型的两个spark sql列合并到一个新的Array [string]列中

问题描述

2 个解决方案

解决方案1 9 2018-10-03 11:51:29

解决方案2 2 已采纳 2018-03-08 10:24:02

解决方案1
9 2018-10-03 11:51:29

解决方案2
2 已采纳 2018-03-08 10:24:02