简体   繁体   中英

How to Transform a Spark Scala Nested Map within a Map Data Structure?

I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.

The result should transform this dataframe:

|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|  123|    ITA|1475600500|18.0|
|  123|    ITA|1475600516|19.0|
+-----+-------+----------+----+

into:

+--------------------------------------------------------------------+
|value                                                               |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+

The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.

case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
  .map(r =>
    Array(
      Record(
        r.getAs[Int]("Value"),
        Map(
          r.getAs[String]("Country") ->
            Map(
              r.getAs[String]("Timestamp") -> new BigDecimal(
                r.getAs[Double]("Sum").toString
              )
            )
        )
      )
    )
  )

The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.

+----------------------------------------------------+
|value                                               |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+

With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.

val mycsv =
    """
      |Value|Country|Timestamp|Sum
      |  123|ITA|1475600500|18.0
      |  123|ITA|1475600516|19.0
    """.stripMargin('|').lines.toList.toDS()


  val df: DataFrame = spark.read.option("header", true)
    .option("sep", "|")
    .option("inferSchema", true)
    .csv(mycsv)
  df.show

  val df1 = df.
    groupBy("Value","Country")
    .agg(  collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")


  val json = df1.toJSON // you can save in to file
  json.show(false)

Result combined 2 rows

+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA    |1475600500|18.0|
|123.0|ITA    |1475600516|19.0|
+-----+-------+----------+----+

+----------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                         |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM