简体   繁体   中英

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P

For example, assuming I am having the following:

case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)

val anRDD = sc.parallelize(Seq(
 (randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
 (randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
 (randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))

val aDF = anRDD.toDF()

Assuming that I am having the aDF how can I get the anRDD ???

I tried something like this just to get the second column but it was giving an error:

aDF.map { case r:Row => r.getAs[randomClass3]("_2")}

You can convert indirectly using Dataset[randomClass3] :

aDF.select($"_2.*").as[randomClass3].rdd

Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.

For the second column, which is struct<a: string, b: string> , it would be a Row as well:

aDF.rdd.map { _.getAs[Row]("_2") }

As commented by Tzach Zohar to get back a full RDD you'll need:

aDF.as[(randomClass2, randomClass3)].rdd 

I don't know the scala API but have you considered the rdd value ?

Maybe something like :

aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM