简体   繁体   中英

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns

  1. date
  2. jsonString1
  3. jsonString2

I want to expand attributes inside json into columns. so i did something like this.

 val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
 val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))

 val json1Table = json1.selectExpr("id", "status")
 val json2Table = json2.selectExpr("name", "address")

now i want to put these table together. so i did the following


     val json1TableWithIndex = addColumnIndex(json1Table)
     val json2TableWithIndex = addColumnIndex(json2Table)
     var finalResult = json1Table
            .join(json2Table, Seq("columnindex"))
            .drop("columnindex")

    def addColumnIndex(df: DataFrame) = spark.createDataFrame(
        df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
        StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
    )

After sampling few rows I observe that rows match exactly as in the source dataframe I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.

It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.

You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.

Here is an example fo the whole process.

import org.apache.spark.sql.types.{StringType, StructType, StructField}

val df = (Seq( 
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"), 
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}") 
).toDF("date","jsonString1","jsonString2"))

scala> df.show()
+----------+--------------------+--------------------+
|      date|         jsonString1|         jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+

val schema1 = (StructType(Seq(
  StructField("id", StringType, true), 
  StructField("status", StringType, true)
)))

val schema2 = (StructType(Seq(
  StructField("name", StringType, true), 
  StructField("address", StringType, true)
)))


val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
            .withColumn("jsonData2", from_json(col("jsonString2"), schema2))
            .withColumn("id", col("jsonData1.id"))
            .withColumn("status", col("jsonData1.status"))
            .withColumn("name", col("jsonData2.name"))
            .withColumn("address", col("jsonData2.address"))
            .drop("jsonString1")
            .drop("jsonString2")
            .drop("jsonData1")
            .drop("jsonData2"))         

scala> dfFlattened.show()
+----------+---+-------+--------+---------+
|      date| id| status|    name|  address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant|     Ali|Jebel Ali|
+----------+---+-------+--------+---------+   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM