Create a Sequence of case class objects from a Spark DataFrame

Question

How do I iterate through Spark DataFrame rows and add them to a Sequence of case class objects?

DF1:

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

Case Class:

case class ValuePerNumber(num:String, wrd:String, defaultID:Int, size: Long=0) {}

Expected Output:

Seq(ValuePerNumber("202003101750", "202003101700",0, 122), ValuePerNumber("202003101800", "202003101700",0, 12), ValuePerNumber("202003101750", "202003101700",0, 42))

In each case I can have the defaultID as 0. I am not sure how to approach and solve this problem and would really appreciate any solution / suggestion!

I have tried the following:

val x = someDF.as[ValuePerNumber].collect()

I get the following error:

org.apache.spark.sql.AnalysisException: cannot resolve '`num`' given input columns: [number, word, value];

EDIT: Kindly upvote if the question/solution helped you in anyway, that in turn will help me in this forum.

Answer 1

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

case class ValuePerNumber(number:String, word:String, defaultID:Int, value: Long)

someDF.withColumn("defaultId", lit(0)).as[ValuePerNumber].collect.toSeq

Answer 2

Number of column count & names in both DataFrame & Case Class should match to use as[ValuePerNumber] on DataFrame directly without extracting values.

size is not available in DataFrame, so added using withColumn
Column names are not matched in both DF & Case class. Modified to match both DF & Case Class.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val someDF = Seq(("202003101750", "202003101700",122),("202003101800", "202003101700",12),("202003101750", "202003101700",42))
.toDF("number", "word","value")
.withColumn("size",lit(0)) // Added this to match your case class columns


// Exiting paste mode, now interpreting.

someDF: org.apache.spark.sql.DataFrame = [number: string, word: string ... 2 more fields]

scala> case class ValuePerNumber(number:String, word:String, value:Int, size: Long=0) // Modified column names to match your dataframe column names.
defined class ValuePerNumber

scala> someDF.as[ValuePerNumber].show(false)
+------------+------------+-----+----+
|number      |word        |value|size|
+------------+------------+-----+----+
|202003101750|202003101700|122  |0   |
|202003101800|202003101700|12   |0   |
|202003101750|202003101700|42   |0   |
+------------+------------+-----+----+


scala>

Answer 3

You can create Dataset[ValuePeerNumber] and collect it as Seq

val someDF = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42)
).toDF("number", "word","value")

val result = someDF.map(r => ValuePerNumber(r.getAs[String](0), r.getAs[String](1), r.getAs[Int](2))).collect().toSeq

You can also add column in dataframe and edit the column name to match case class that you can directly do

val x = someDF.as[ValuePerNumber].collect()

Create a Sequence of case class objects from a Spark DataFrame

Question

3 answers

solution1
3 2020-04-28 06:48:22

solution2
3 2020-04-28 06:50:26

solution3
2 ACCPTED 2020-04-28 06:37:52

Create a Sequence of case class objects from a Spark DataFrame

Question

3 answers

solution1 3 2020-04-28 06:48:22

solution2 3 2020-04-28 06:50:26

solution3 2 ACCPTED 2020-04-28 06:37:52

solution1
3 2020-04-28 06:48:22

solution2
3 2020-04-28 06:50:26

solution3
2 ACCPTED 2020-04-28 06:37:52