简体   繁体   中英

Converting a Spark Dataframe to a Scala Map collection list

I'm trying to transform a Spark dataframe into a Scalar map and additionally a list of values.

It is best illustrated as follows:

val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
|  21|Michael|
+----+-------+

To a Scala collection (Map of Maps(List(values))) represented like this:

Map(
  (0 -> List(Map("age" -> null, "name" -> "Michael"), Map("age" -> 21, "name" -> "Michael"))),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)

As I don't know much about Scala, I wonder if this method is possible. It doesn't matter if it's not necessarily a List.

The data structure you want is actually useless. Let me explain what I mean by asking 2 questions:

    1. What is the purpose of the integers of the outside map? are those indices? What is the logic of those indices? If those are indices, why not just use Array ?
    1. Why to use Map[String, Any] and do unsafe element accessing, while you can model into case classes?

So I think the best thing you can do would be this:

case class Person(name: String, age: Option[Int])
val persons = df.as[Person].collect
val personsByName: Map[String, Array[Person]] = persons.groupBy(_.name)

Result would be:

Map(
  Michael -> Array(Person(Michael, None), Person(Michael, Some(21)),
  Andy -> Array(Person(Andy, Some(30))),
  Justin -> Array(Person(Justin, Some(19)))
)

But still, if you insist on the data structure, this is the code you need to use:

val result: Map[Int, List[Map[String, Any]]] =
  persons.groupBy(_.name)       // grouping persons by name
  .zipWithIndex                 // coupling index with values of array
  .map { 
    case ((name, persons), index) => 
      // put index as key, map each person to the desired map
      index -> persons.map(p => Map("age" -> p.age, "name" -> p.name)).toList 
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM