將 Spark Dataframe 轉換為 Scala Map 集合列表

Question

我正在嘗試將 Spark 數據框轉換為標量映射以及值列表。

最好如下圖所示：

val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
|  21|Michael|
+----+-------+

對一個 Scala 集合（Map of Maps(List(values))）表示如下：

Map(
  (0 -> List(Map("age" -> null, "name" -> "Michael"), Map("age" -> 21, "name" -> "Michael"))),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)

由於我對Scala不太了解，所以我想知道這種方法是否可行。 如果它不一定是列表也沒關系。

Answer 1

你想要的數據結構其實是沒用的。 讓我通過問兩個問題來解釋我的意思：

1. 外部地圖的整數的目的是什么？ 那些指數？ 這些指數的邏輯是什么？ 如果這些是索引，為什么不直接使用Array ？
1. 為什么要使用Map[String, Any]並進行不安全的元素訪問，而您可以建模為案例類？

所以我認為你能做的最好的事情是：

case class Person(name: String, age: Option[Int])
val persons = df.as[Person].collect
val personsByName: Map[String, Array[Person]] = persons.groupBy(_.name)

結果將是：

Map(
  Michael -> Array(Person(Michael, None), Person(Michael, Some(21)),
  Andy -> Array(Person(Andy, Some(30))),
  Justin -> Array(Person(Justin, Some(19)))
)

但是，如果您堅持數據結構，這是您需要使用的代碼：

val result: Map[Int, List[Map[String, Any]]] =
  persons.groupBy(_.name)       // grouping persons by name
  .zipWithIndex                 // coupling index with values of array
  .map { 
    case ((name, persons), index) => 
      // put index as key, map each person to the desired map
      index -> persons.map(p => Map("age" -> p.age, "name" -> p.name)).toList 
    }

將 Spark Dataframe 轉換為 Scala Map 集合列表

問題描述

1 個解決方案

解決方案1
0 已采納 2022-07-08 10:07:16

將 Spark Dataframe 轉換為 Scala Map 集合列表

問題描述

1 個解決方案

解決方案1 0 已采納 2022-07-08 10:07:16

解決方案1
0 已采納 2022-07-08 10:07:16