简体   繁体   English

Spark - java.lang.ClassCastException 处理成 udf 类型为 Array[Array[Map[String,String]]] 的列时

[英]Spark - java.lang.ClassCastException when processing into a udf a column of type Array[Array[Map[String,String]]]

I am concatenating two columns in spark of type Array[Map[String,String]] resulting in a new column of type Array[Array[Map[String,String]]] .我在Array[Map[String,String]]类型的 spark 中连接两列,从而产生一个Array[Array[Map[String,String]]]类型的新列。 However I would like to flatten that column to end up having a columns of type Array[Map[String,String]] with the values of both of the original columns但是,我想将该列展平,最终得到一个Array[Map[String,String]]类型的列,其中包含两个原始列的值

I have read that from Spark 2.4 it would be possible to apply flatten directly on the concatenation of the columns.我从 Spark 2.4 中读到过,可以将flatten直接应用于列的串联。 Something like this:像这样的东西:

df.withColumn("concatenation", flatten(array($"colArrayMap1", $"colArrayMap2")))

However I am still with Spark 2.2, so I need to use a udf for that.但是我仍然使用 Spark 2.2,所以我需要为此使用 udf。 This is what I wrote:这是我写的:

def flatten_collection(arr: Array[Array[Map[String,String]]]) = {
    if(arr == null)
        null
    else
        arr.flatten
}
  
val flatten_collection_udf = udf(flatten_collection _)

df.withColumn("concatenation", array($"colArrayMap1", $"colArrayMap2")).withColumn("concatenation", flatten_collection_udf($"concatenation")).show(false)

But I am getting the following error:但我收到以下错误:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<array<map<string,string>>>) => array<map<string,string>>)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:835)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:835)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:380)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [[Lscala.collection.immutable.Map;

I assume the cast error is happening in the udf, but why and how to avoid it?我假设 udf 中发生了强制转换错误,但是为什么以及如何避免它呢?

Besides if someone knows a solution for Spark 2.2 which doesn't require to use UDF even better此外,如果有人知道 Spark 2.2 的解决方案,它不需要更好地使用 UDF

Adapted from the answer here .改编自这里的答案。 Seq is needed instead of Array .需要Seq而不是Array

def concat_arr(
    arr1: Seq[Map[String,String]],
    arr2: Seq[Map[String,String]]
) : Seq[Map[String,String]] =
{
    (arr1 ++ arr2)
}
val concatUDF = udf(concat_arr _)

val df2 = df.withColumn("concatenation", concatUDF($"colArrayMap1", $"colArrayMap2"))

df2.show(false)
+--------------------+--------------------+----------------------------------------+
|colArrayMap1        |colArrayMap2        |concatenation                           |
+--------------------+--------------------+----------------------------------------+
|[[a -> b], [c -> d]]|[[a -> b], [c -> d]]|[[a -> b], [c -> d], [a -> b], [c -> d]]|
+--------------------+--------------------+----------------------------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将UDF应用于收集的行? (失败与“ java.lang.ClassCastException:java.lang.String无法转换为org.apache.spark.sql.Column”) - How to apply UDF to collected Rows? (fails with “java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Column”) java.lang.ClassCastException:[B在解析json [String,String]时不能转换为java.lang.String - java.lang.ClassCastException: [B cannot be cast to java.lang.String while parsing json[String,String] java.lang.ClassCastException:java.lang.String 无法转换为 java.lang.Float - java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float java.lang.ClassCastException:无法将java.lang.String强制转换为com.fastdata.persistence.PersistenceService - java.lang.ClassCastException: java.lang.String cannot be cast to com.fastdata.persistence.PersistenceService java.lang.ClassCastException: org.apache.spark.sql.Column cannot be cast to scala.collection.Seq - java.lang.ClassCastException: org.apache.spark.sql.Column cannot be cast to scala.collection.Seq 用于数组数据的 Spark 3 Scala UDF 中的 GenericRowWithSchema ClassCastException - GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data 将Spark RDD从文本文件转换为Dataframe时,会出现java.lang.ClassCastException - java.lang.ClassCastException arises when convert Spark RDD to a Dataframe from a text file 在Spark Scala中将Array [seq [String]]传递给UDF - Pass Array[seq[String]] to UDF in spark scala Spark Streaming:广播变量,java.lang.ClassCastException - Spark Streaming: Broadcast variables, java.lang.ClassCastException 将数据从 Spark 保存到 Cassandra 导致 java.lang.ClassCastException - Saving data from Spark to Cassandra results in java.lang.ClassCastException
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM