简体   繁体   中英

Create a Sequence of case class objects from a Spark DataFrame

How do I iterate through Spark DataFrame rows and add them to a Sequence of case class objects?

DF1:

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

Case Class:

case class ValuePerNumber(num:String, wrd:String, defaultID:Int, size: Long=0) {}

Expected Output:

Seq(ValuePerNumber("202003101750", "202003101700",0, 122), ValuePerNumber("202003101800", "202003101700",0, 12), ValuePerNumber("202003101750", "202003101700",0, 42)) 

In each case I can have the defaultID as 0. I am not sure how to approach and solve this problem and would really appreciate any solution / suggestion!

I have tried the following:

val x = someDF.as[ValuePerNumber].collect()

I get the following error:

org.apache.spark.sql.AnalysisException: cannot resolve '`num`' given input columns: [number, word, value];

EDIT: Kindly upvote if the question/solution helped you in anyway, that in turn will help me in this forum.

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

case class ValuePerNumber(number:String, word:String, defaultID:Int, value: Long)

someDF.withColumn("defaultId", lit(0)).as[ValuePerNumber].collect.toSeq

Number of column count & names in both DataFrame & Case Class should match to use as[ValuePerNumber] on DataFrame directly without extracting values.

  1. size is not available in DataFrame, so added using withColumn
  2. Column names are not matched in both DF & Case class. Modified to match both DF & Case Class.
scala> :paste
// Entering paste mode (ctrl-D to finish)

val someDF = Seq(("202003101750", "202003101700",122),("202003101800", "202003101700",12),("202003101750", "202003101700",42))
.toDF("number", "word","value")
.withColumn("size",lit(0)) // Added this to match your case class columns


// Exiting paste mode, now interpreting.

someDF: org.apache.spark.sql.DataFrame = [number: string, word: string ... 2 more fields]

scala> case class ValuePerNumber(number:String, word:String, value:Int, size: Long=0) // Modified column names to match your dataframe column names.
defined class ValuePerNumber

scala> someDF.as[ValuePerNumber].show(false)
+------------+------------+-----+----+
|number      |word        |value|size|
+------------+------------+-----+----+
|202003101750|202003101700|122  |0   |
|202003101800|202003101700|12   |0   |
|202003101750|202003101700|42   |0   |
+------------+------------+-----+----+


scala>

You can create Dataset[ValuePeerNumber] and collect it as Seq

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

val result = someDF.map(r => ValuePerNumber(r.getAs[String](0), r.getAs[String](1), r.getAs[Int](2))).collect().toSeq

You can also add column in dataframe and edit the column name to match case class that you can directly do

val x = someDF.as[ValuePerNumber].collect()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM