Dynamically creating dataframes in Spark Scala

Question

I have few columns data coming out of a Dataframe 1, in a loop (from different rows). I want to create a Dataframe 2 with all this different rows/columns data.

Below is sample data, I tried using Seq:

var DF1 = Seq(
  ("11111111", "0101","6573","X1234",12763),
  ("44444444", "0148","8382","Y5678",-2883),
  ("55555555", "0154","5240","Z9011", 8003))

I want to add 2 dynamic rows below to the above Seq and then use the final Seq to create a Dataframe.

  ("88888888", "1333","7020","DEF34",500)
  ("99999999", "1333","7020","GHI56",500)

Final Seq or Dataframe should look like below:

   var DF3 = Seq(
      ("11111111", "0101","6573","X1234",12763),
      ("44444444", "0148","8382","Y5678",-2883),
      ("55555555", "0154","5240","Z9011", 8003),
      ("88888888", "1333","7020","DEF34",500),
      ("99999999", "1333","7020","GHI56",500))

Tried below code using Seq, created Case Class, to use it possibly. Problem is the when a new row is added to a Seq, it returns a new Seq with new row added in. How to get an updated Seq with new row added to it ? If not Seq, is ArrayBuffer a good idea to use ?

  case class CreateDFTestCaseClass(ACCOUNT_NO: String, LONG_IND: String, SHORT_IND: String,SECURITY_ID: String, QUANTITY: Integer)
  val sparkSession = SparkSession
    .builder()
    .appName("AllocationOneViewTest")
    .master("local")
    .getOrCreate()
  val sc = sparkSession.sparkContext
  import sparkSession.sqlContext.implicits._
  def main(args: Array[String]): Unit = {
    var acctRulesPosDF = Seq(
      ("11111111", "0101","6573","X1234",12763),
      ("44444444", "0148","8382","Y5678",-2883),
      ("55555555", "0154","5240","Z9011", 8003))
    acctRulesPosDF:+ ("88888888", "1333","7020","DEF34",500)
    acctRulesPosDF:+ ("99999999", "1333","7020","GHI56",500))
    var DF3 = acctRulesPosDF.toDF
    DF3.show()

Answer 1

It's not the most elegant way, but keeping your code as similar to the original as possible, you just need to assign the result back to your variable.

 var acctRulesPosDF = Seq(
      ("11111111", "0101","6573","X1234",12763),
      ("44444444", "0148","8382","Y5678",-2883),
      ("55555555", "0154","5240","Z9011", 8003))
    acctRulesPosDF = acctRulesPosDF:+ ("88888888", "1333","7020","DEF34",500)
    acctRulesPosDF = acctRulesPosDF:+ ("99999999", "1333","7020","GHI56",500)

Quick example in the spark-shell

scala>  var acctRulesPosDF = Seq(
     |       ("11111111", "0101","6573","X1234",12763),
     |       ("44444444", "0148","8382","Y5678",-2883),
     |       ("55555555", "0154","5240","Z9011", 8003))
acctRulesPosDF: Seq[(String, String, String, String, Int)] = List((11111111,0101,6573,X1234,12763), (44444444,0148,8382,Y5678,-2883), (55555555,0154,5240,Z9011,8003))

scala>     acctRulesPosDF = acctRulesPosDF:+ ("88888888", "1333","7020","DEF34",500)
acctRulesPosDF: Seq[(String, String, String, String, Int)] = List((11111111,0101,6573,X1234,12763), (44444444,0148,8382,Y5678,-2883), (55555555,0154,5240,Z9011,8003), (88888888,1333,7020,DEF34,500))

scala>     acctRulesPosDF = acctRulesPosDF:+ ("99999999", "1333","7020","GHI56",500)
acctRulesPosDF: Seq[(String, String, String, String, Int)] = List((11111111,0101,6573,X1234,12763), (44444444,0148,8382,Y5678,-2883), (55555555,0154,5240,Z9011,8003), (88888888,1333,7020,DEF34,500), (99999999,1333,7020,GHI56,500))

scala> var DF3 = acctRulesPosDF.toDF
DF3: org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 3 more fields]

scala>     DF3.show()
+--------+----+----+-----+-----+
|      _1|  _2|  _3|   _4|   _5|
+--------+----+----+-----+-----+
|11111111|0101|6573|X1234|12763|
|44444444|0148|8382|Y5678|-2883|
|55555555|0154|5240|Z9011| 8003|
|88888888|1333|7020|DEF34|  500|
|99999999|1333|7020|GHI56|  500|
+--------+----+----+-----+-----+

Answer 2

The reason you are get the same old Seq even you appended new rows is that, the Seq that is by default imported is of type scala.collection.immutable.Seq (which will not be changed) unless you specify separately to import mutable Seq enter code here using scala.collection.mutable.Seq . So either use mutable Seq by setting import explicitly in scala or do as suggested by @SCouto in the other answer.

Dynamically creating dataframes in Spark Scala

Question

2 answers

solution1
0 2019-08-26 17:02:09

solution2
0 2019-08-26 20:09:41

Dynamically creating dataframes in Spark Scala

Question

2 answers

solution1 0 2019-08-26 17:02:09

solution2 0 2019-08-26 20:09:41

solution1
0 2019-08-26 17:02:09

solution2
0 2019-08-26 20:09:41