简体   繁体   中英

How to covert nested struct into nested map for Spark DataFrame

I am trying to batchwrite into AWS DynamoDB, and I have to reformat the dataFrame before the loading, now my question is how to convert deep structType dataFrame into deep Map format which can be recognized by DynamoDB and needn't define fields by fields manually ?

Environment: Apache Spark 2.4.3/Spark 2.4.3 in Databricks, Scala 2.11, DynamoDB

The source has a deep structure like below

root
 |-- PK: string (nullable = false)
 |-- SK: string (nullable = false)
 |-- ee: struct (nullable = false)
 |    |-- kv: struct (nullable = false)
 |    |    |-- ss: map (nullable = true)
 |    |    |-- pp: struct (nullable = true)
 |    |    |    |-- gg: string (nullable = true)
 |    |    |    |-- nn: struct (nullable = true)
 |    |    |    |    |-- mm: string (nullable = true)
 |    |    |-- ll: array (nullable = true)
 |    |    |    |-- le: struct (containsNull = true)
 |    |    |    |    |-- lep: struct (nullable = true)

I found some samples but normally they can only handle 1-2 level nested structure, but my dataFrame is "deeper" for this case.

Below function will handle deeply nested dataframe with any level.

val spark = SparkSession.builder().master("local[*]").getOrCreate()
  import org.apache.spark.sql.functions._
  spark.sparkContext.setLogLevel("ERROR")

// TODO: Instead of while/for loop, we can use pattern matching also.
def getFlattenDF(dataFrame: DataFrame): DataFrame = {
    var df = dataFrame
    var flag = true
    while (flag) {
      for ((name, types) <- df.dtypes) {
        if (types.startsWith("Array"))
          df = df.withColumn(name, explode_outer(col(name)))
        else if(types.startsWith("Map"))
          df = df.selectExpr("*", s"explode_outer($name)").drop(name)
        else if (types.startsWith("Struct"))
          df = df.selectExpr(Array("*") ++ df.select(s"$name.*").columns.map(s => s"$name" + "." + s + s" as $name" + s"_$s"): _*).drop(name)
      }
      flag = false
      for ((name, types) <- df.dtypes) {
        if (types.startsWith("Array") || types.startsWith("Struct") || types.startsWith("Map"))
          flag = true
      }
    }
    df
  }

 val df = //input dataframe

 getFlattenDF(df)

Here's my final solution:

/** convert from json to map, then wrap with AttributeValue class */
val tempMap = new ObjectMapper().readValue(testStringText, new TypeReference[JMap[String, Object]](){})
val testMap = toAttribute(tempMap)
new AttributeValue().withM(testMap.getM())

/** nest conversion - scala to java class, wrap with AttributeValue */
def toAttribute(m: Any): AttributeValue = {
    m match {
      case sm: java.util.LinkedHashMap[_, _] => {
        new AttributeValue().withM(sm.map(kv => (kv._1.toString, toAttribute(kv._2))).asJava)
      }
      case sl: java.util.ArrayList[_] => {
        new AttributeValue().withL(sl.map(item => toAttribute(item)).asJava.asInstanceOf[JCollection[AttributeValue]])
      }
      case st: String => new AttributeValue().withS(st)
      case bol: Boolean => new AttributeValue().withBOOL(bol)
      case dbl: java.lang.Double => new AttributeValue().withN(dbl.toString)
      case int: java.lang.Integer => new AttributeValue().withS(int.toString)
      case _ => {
        new AttributeValue()
      }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM