简体   繁体   中英

Scala compare dataframe complex array type field

I'm trying to create a dataframe to feed to a function as part of my unit tests. If I have the following

val myDf = sparkSession.sqlContext.createDataFrame(
  sparkSession.sparkContext.parallelize(Seq(
    Row(Some(Seq(MyObject(1024, 100001D), MyObject(1, -1D)))))),
    StructType(List(
      StructField("myList", ArrayType[???], true)
    )))

MyObject is a case class.

I don't know what to put for the object type. Any suggestions? I've tried ArrayType of pretty much every combination I can think of.

I'm looking for a dataframe that looks something like:

+--------------------+
|   myList           |
+--------------------+
| [1024, 100001]     |
| [1, -1]            |
+--------------------+

Coming in the reverse way...

val s = Seq(Array(1024, 100001D), Array(1, -1D)).toDS().toDF("myList")
println(s.schema)
s.printSchema
s.show

Your schema is like below... DoubleType is coming since these 100001D and -1D are double.

StructType(StructField(myList,ArrayType(DoubleType,false),true))

Output you needed:

root
 |-- myList: array (nullable = true)
 |    |-- element: double (containsNull = false)

+------------------+
|             myList|
+------------------+
|[1024.0, 100001.0]|
|       [1.0, -1.0]|
+------------------+

Or this way also you can do that.

case class MyObject(a:Int , b:Double)

val s = Seq(MyObject(1024, 100001D), MyObject(1, -1D)).toDS()
  .select(struct($"a",$"b").as[MyObject] as "myList")
println(s.schema)
s.printSchema
s.show

Result:

//schema :
StructType(StructField(myList,StructType(StructField(a,IntegerType,false), StructField(b,DoubleType,false)),false))

root
 |-- myList: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: double (nullable = false)

+----------------+
|          myList|
+----------------+
|[1024, 100001.0]|
|       [1, -1.0]|
+----------------+

Try this

scala> case class MyObject(prop1:Int, prop2:Double)
defined class MyObject

scala> val df = Seq((1024, 100001D), (1, -1D)).toDF("prop1","prop2").select(struct($"prop1",$"prop2").as[MyObject] as "myList")
df: org.apache.spark.sql.DataFrame = [myList: struct<prop1: int, prop2: double>]

scala> df.show(false)
+----------------+
|myList          |
+----------------+
|[1024, 100001.0]|
|[1, -1.0]       |
+----------------+


scala> df.printSchema
root
 |-- myList: struct (nullable = false)
 |    |-- prop1: integer (nullable = false)
 |    |-- prop2: double (nullable = false)


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM