what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

Question

I have dataframe with 3 columns

date
jsonString1
jsonString2

I want to expand attributes inside json into columns. so i did something like this.

 val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
 val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))

 val json1Table = json1.selectExpr("id", "status")
 val json2Table = json2.selectExpr("name", "address")

now i want to put these table together. so i did the following


     val json1TableWithIndex = addColumnIndex(json1Table)
     val json2TableWithIndex = addColumnIndex(json2Table)
     var finalResult = json1Table
            .join(json2Table, Seq("columnindex"))
            .drop("columnindex")

    def addColumnIndex(df: DataFrame) = spark.createDataFrame(
        df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
        StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
    )

After sampling few rows I observe that rows match exactly as in the source dataframe I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.

Answer 1

It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.

You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.

Here is an example fo the whole process.

import org.apache.spark.sql.types.{StringType, StructType, StructField}

val df = (Seq( 
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"), 
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}") 
).toDF("date","jsonString1","jsonString2"))

scala> df.show()
+----------+--------------------+--------------------+
|      date|         jsonString1|         jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+

val schema1 = (StructType(Seq(
  StructField("id", StringType, true), 
  StructField("status", StringType, true)
)))

val schema2 = (StructType(Seq(
  StructField("name", StringType, true), 
  StructField("address", StringType, true)
)))


val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
            .withColumn("jsonData2", from_json(col("jsonString2"), schema2))
            .withColumn("id", col("jsonData1.id"))
            .withColumn("status", col("jsonData1.status"))
            .withColumn("name", col("jsonData2.name"))
            .withColumn("address", col("jsonData2.address"))
            .drop("jsonString1")
            .drop("jsonString2")
            .drop("jsonData1")
            .drop("jsonData2"))         

scala> dfFlattened.show()
+----------+---+-------+--------+---------+
|      date| id| status|    name|  address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant|     Ali|Jebel Ali|
+----------+---+-------+--------+---------+

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

Question

1 answers

solution1
0 ACCPTED 2020-02-10 10:57:05

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

Question

1 answers

solution1 0 ACCPTED 2020-02-10 10:57:05

solution1
0 ACCPTED 2020-02-10 10:57:05