[英]Spark - java.lang.ClassCastException when processing into a udf a column of type Array[Array[Map[String,String]]]
I am concatenating two columns in spark of type Array[Map[String,String]]
resulting in a new column of type Array[Array[Map[String,String]]]
.我在
Array[Map[String,String]]
类型的 spark 中连接两列,从而产生一个Array[Array[Map[String,String]]]
类型的新列。 However I would like to flatten that column to end up having a columns of type Array[Map[String,String]]
with the values of both of the original columns但是,我想将该列展平,最终得到一个
Array[Map[String,String]]
类型的列,其中包含两个原始列的值
I have read that from Spark 2.4 it would be possible to apply flatten
directly on the concatenation of the columns.我从 Spark 2.4 中读到过,可以将
flatten
直接应用于列的串联。 Something like this:像这样的东西:
df.withColumn("concatenation", flatten(array($"colArrayMap1", $"colArrayMap2")))
However I am still with Spark 2.2, so I need to use a udf for that.但是我仍然使用 Spark 2.2,所以我需要为此使用 udf。 This is what I wrote:
这是我写的:
def flatten_collection(arr: Array[Array[Map[String,String]]]) = {
if(arr == null)
null
else
arr.flatten
}
val flatten_collection_udf = udf(flatten_collection _)
df.withColumn("concatenation", array($"colArrayMap1", $"colArrayMap2")).withColumn("concatenation", flatten_collection_udf($"concatenation")).show(false)
But I am getting the following error:但我收到以下错误:
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<array<map<string,string>>>) => array<map<string,string>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:835)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:835)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [[Lscala.collection.immutable.Map;
I assume the cast error is happening in the udf, but why and how to avoid it?我假设 udf 中发生了强制转换错误,但是为什么以及如何避免它呢?
Besides if someone knows a solution for Spark 2.2 which doesn't require to use UDF even better此外,如果有人知道 Spark 2.2 的解决方案,它不需要更好地使用 UDF
Adapted from the answer here .改编自这里的答案。
Seq
is needed instead of Array
.需要
Seq
而不是Array
。
def concat_arr(
arr1: Seq[Map[String,String]],
arr2: Seq[Map[String,String]]
) : Seq[Map[String,String]] =
{
(arr1 ++ arr2)
}
val concatUDF = udf(concat_arr _)
val df2 = df.withColumn("concatenation", concatUDF($"colArrayMap1", $"colArrayMap2"))
df2.show(false)
+--------------------+--------------------+----------------------------------------+
|colArrayMap1 |colArrayMap2 |concatenation |
+--------------------+--------------------+----------------------------------------+
|[[a -> b], [c -> d]]|[[a -> b], [c -> d]]|[[a -> b], [c -> d], [a -> b], [c -> d]]|
+--------------------+--------------------+----------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.