简体   繁体   中英

How to create an ArrayData or InternalRow given a StructType as Schema in SparkSQL?

When defining an UDT in SparkSQL, I make a UDT like this

class trajUDT extends UserDefinedType[traj] {
  override def sqlType: DataType = StructType(Seq(
    StructField("id", DataTypes.StringType),
    StructField("loc", ArrayType(StructType(Seq(
      StructField("x",DataTypes.DoubleType),
      StructField("y",DataTypes.DoubleType)
    ))))
 ))
 ...
 }

where traj is a Class

class traj(val id:UTF8String,val loc:Array[Tuple2[Double,Double]] )

and I want to write a serialize funtion like this

override def serialize(p: traj): GenericInternalRow = {
  new GenericInternalRow(Array[Any](p.id,p.loc.map(x=>Array(x._1,x._2)))
}

But it failed as it told me that this cannot be convert to a ArrayData.

I also write a deserialize function like this:

override def deserialize(datum: Any): traj = {
  val arr=datum.asInstanceOf[InternalRow]
  val id = arr.getUTF8String(0)
  val xytype=StructType(Seq(
    StructField("x",DataTypes.DoubleType),
    StructField("y",DataTypes.DoubleType)
  ))
  val xy = arr.getArray(1)
  val xye =xy.toArray[Tuple2[Double,Double]](xytype)
  new traj(id,xye)
}

And I guess it could also not work...

So can someone teach me how to do these two conversion?

I faced a similar problem while working with InternalRow

Constructing an InternalRow with an Array or Seq leads to java.lang.ClassCastException .

import org.apache.spark.sql.catalyst.InternalRow

val row = InternalRow(Array(1, 2, 3), 1L)
println(s"Row first element: ${row.getArray(0).toIntArray.toVector}")
println(s"Row second element: ${row.getLong(1)}")
java.lang.ClassCastException: [I cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:48)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)

I solved this by passing an ArrayData field instead of Array or Seq . I used the ArrayData.toArrayData method as follows:

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.util.ArrayData

val row = InternalRow(ArrayData.toArrayData(Array(1, 2, 3)), 1L)
println(s"Row first element: ${row.getArray(0).toIntArray.toVector}")
println(s"Row second element: ${row.getLong(1)}")
Row first element: Vector(1, 2, 3)
Row second element: 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM