Scala UDF返回“不支持单位类型的模式”

Question

I want to make changes to a column in the dataframe. 我想更改数据框中的列。 The column is an Array for Integers. 该列是一个整数数组。 I want to replace an elements of the array, taking index from another array and replacing that element with an element from third array. 我想替换数组的元素，从另一个数组获取索引，然后用第三个数组的元素替换该元素。 Example: I have three columns C1, C2, C3 all three arrays. 示例：我有三个列C1，C2，C3，所有三个数组。 I want to replace elements in C3 as follows. 我想按如下方式替换C3中的元素。

C3[C2[i]] = C1[i].

I wrote the following UDF: 我写了以下UDF：

def UpdateHist = udf((CRF_count: Seq[Long], Day: Seq[String], History: Seq[Int])=> for(i <- 0 to Day.length-1){History.updated(Day(i).toInt-1 , CRF_count(i).toInt)})

and executed this: 并执行以下命令：

histdate3.withColumn("History2", UpdateHist2(col("CRF_count"), col("Day"), col("History"))).show()

But its returning an error as below: 但是它返回错误如下：

scala> histdate3.withColumn("History2", UpdateHist2(col("CRF_count"), col("Day"), col("History"))).show()

java.lang.UnsupportedOperationException: Schema for type Unit is not supported at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671) at org.apache.spark.sql.functions$.udf(functions.scala:3100) at UpdateHist2(:25) ... 48 elided java.lang.UnsupportedOperationException：org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor上的org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor（ScalaReflection.scala：733）不支持单元类型的架构（ScalaReflection.scala：671）在UpdateHist2（：25）的org.apache.spark.sql.functions $ .udf（functions.scala：3100）... 48被淘汰

I think I'm returning some different type, a View type which is not supported. 我想我要返回一些不同的类型，即不支持的View类型。 Please help me how I can solve this. 请帮我解决这个问题。

Answer 1

Your for loop returns a Unit hence the error message. 您的for循环返回一个Unit因此出现错误消息。 You could use for-yield to return values, but since the Seq should be updated successively, a simple foldLeft would work better: 您可以使用for-yield返回值，但是由于Seq应该连续updated ，因此简单的foldLeft会更好：

import org.apache.spark.sql.functions._

val df = Seq(
  (Seq(101L, 102L), Seq("1", "2"), Seq(11, 12)),
  (Seq(201L, 202L, 203L), Seq("2", "3"), Seq(21, 22, 23))
).toDF("C1", "C2", "C3")
// +---------------+------+------------+
// |C1             |C2    |C3          |
// +---------------+------+------------+
// |[101, 102]     |[1, 2]|[11, 12]    |
// |[201, 202, 203]|[2, 3]|[21, 22, 23]|
// +---------------+------+------------+

def updateC3 = udf( (c1: Seq[Long], c2: Seq[String], c3: Seq[Int]) =>
  c2.foldLeft( c3 ){ (acc, i) =>
    val idx = i.toInt - 1
    acc.updated(idx, c1(idx).toInt)
  }
)

df.withColumn("C3", updateC3($"C1", $"C2", $"C3")).show(false)
// +---------------+------+--------------+
// |C1             |C2    |C3            |
// +---------------+------+--------------+
// |[101, 102]     |[1, 2]|[101, 102]    |
// |[201, 202, 203]|[2, 3]|[21, 202, 203]|
// +---------------+------+--------------+

Scala UDF返回“不支持单位类型的模式”

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-07-02 20:28:39

Scala UDF返回“不支持单位类型的模式”

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-07-02 20:28:39

解决方案1
1 已采纳 2018-07-02 20:28:39