简体   繁体   English

Scala - Spark Dataframe - 将行转换为Map变量

[英]Scala - Spark Dataframe - Convert rows to Map variable

I have a Spark Dataframe 我有一个Spark Dataframe

Level    Hierarchy   Code
--------------------------
Level1  Hier1        1
Level1  Hier2        2
Level1  Hier3        3
Level1  Hier4        4
Level1  Hier5        5
Level2  Hier1        1
Level2  Hier2        2
Level2  Hier3        3

I need to convert this to a Map variable like Map[String, Map[Int, String]] 我需要将它转换为Map变量,如Map [String,Map [Int,String]]

ie

Map["Level1", Map[1->"Hier1", 2->"Hier2", 3->"Hier3", 4->"Hier4", 5->"Hier5"]]
Map["Level2", Map[1->"Hier1", 2->"Hier2", 3->"Hier3"]]

Please suggest a suitable approach to achieve this functionality. 请建议一种合适的方法来实现此功能。

My attempt. 我的尝试。 It works, but ugly 它有效,但很难看

val level_code_df =master_df.select("Level","Hierarchy","Code").distinct()
val hierarchy_names = level_code_df.select("Level").distinct().collect()
val hierarchy_size = hierarchy_names.size
var hierarchyMap : scala.collection.mutable.Map[String, scala.collection.mutable.Map[Int,String]] =  scala.collection.mutable.Map[String, scala.collection.mutable.Map[Int,String]]()      
for(i <- 0 to hierarchy_size.toInt-1)    
println("names:"+hierarchy_names(i)(0))
val name = hierarchy_names(i)(0).toString()
val code_level_map = level_code_df.rdd.map{row => {
if(name.equals(row.getAs[String]("Level"))){
Map(row.getAs[String]("Code").toInt -> row.getAs[String]("Hierarchy"))
 } else 
 Map[Int, String]()
  }}.reduce(_++_)

  hierarchyMap = hierarchyMap + (name -> (collection.mutable.Map() ++ code_level_map))     
  }           

   }     

You need to use dataframe.groupByKey("level") followeed by mapGroups . 您需要使用mapGroups后面的mapGroups dataframe.groupByKey("level") Don't forget also to include kryo map encoder: 别忘了还包括kryo地图编码器:

case class Data(level: String, hierarhy: String, code: Int)
val data = Seq(
Data("Level1","Hier1",1),
Data("Level1","Hier2",2),
Data("Level1","Hier3",3),
Data("Level1","Hier4",4),
Data("Level1","Hier5",5),
Data("Level2","Hier1",1),
Data("Level2","Hier2",2),
Data("Level2","Hier3",3)).toDS
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Map[Int, String]]]

Spark 2.0+ : Spark 2.0+:

data.groupByKey(_.level).mapGroups{ 
    case (level, values) => Map(level -> values.map(v => (v.code, v.hierarhy)).toMap) 
}.collect() 
//Array[Map[String,Map[Int,String]]] = Array(Map(Level1 -> Map(5 -> Hier5, 1 -> Hier1, 2 -> Hier2, 3 -> Hier3, 4 -> Hier4)), Map(Level2 -> Map(1 -> Hier1, 2 -> Hier2, 3 -> Hier3)))

Spark 1.6+: Spark 1.6+:

data.rdd.groupBy(_.level).map{
  case (level, values) => Map(level -> values.map(v => (v.code, v.hierarhy)).toMap)
}.collect()
//Array[Map[String,Map[Int,String]]] = Array(Map(Level2 -> Map(1 -> Hier1, 2 -> Hier2, 3 -> Hier3)), Map(Level1 -> Map(5 -> Hier5, 1 -> Hier1, 2 -> Hier2, 3 -> Hier3, 4 -> Hier4)))

@prudenko's answer is probably the most concise - and should work with Spark 1.6 or later. @ prudenko的答案可能是最简洁的 - 应该适用于Spark 1.6或更高版本。 But - if you're looking for a solution that stays with DataFrames API (and not Datasets ), here's one using a simple UDF: 但是 - 如果您正在寻找与DataFrames API(而不是数据集 )保持一致的解决方案,那么这里使用简单的UDF:

val mapCombiner = udf[Map[Int, String], mutable.WrappedArray[Map[Int, String]]] {_.reduce(_ ++ _)}

val result: Map[String, Map[Int, String]] = df
  .groupBy("Level")
  .agg(collect_list(map($"Code", $"Hierarchy")) as "Maps")
  .select($"Level", mapCombiner($"Maps") as "Combined")
  .rdd.map(r => (r.getAs[String]("Level"), r.getAs[Map[Int, String]]("Combined")))
  .collectAsMap()

NOTICE that this will perform badly (or OOM) if there might by thousands of different values for a single key (value of Level ), but since you're collecting this all into driver memory anyway, this probably won't be an issue or your requirement won't work regardless. 请注意 ,如果单个键( Level值)可能有数千个不同的值,那么这将表现不佳(或OOM),但由于您无论如何都要将此全部收集到驱动程序内存中,这可能不是问题或者你的要求无论如何都行不通。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM