[英]Spark dataframe to nested map
How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spark's DAG. 我如何将Spark中最大的数据帧(最大300 MB)转换为嵌套地图,以改善Spark的DAG。 I believe this operation will be quicker than a join later on ( Spark dynamic DAG is a lot slower and different from hard coded DAG ) as the transformed values were created during the train step of a custom estimator.
我相信,此操作将比稍后的连接更快( Spark动态DAG慢得多,并且与硬编码DAG不同 ),因为在自定义估算器的训练步骤中创建了转换后的值。 Now I just want to apply them really quick during predict step of the pipeline.
现在,我只想在管道的预测步骤中非常快地应用它们。
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
This gives the wrong type of map 这给出了错误的地图类型
val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
I would rather want something like: 我宁愿这样的事情:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)
Edit: removed collect operation from final map 编辑:从最终地图中删除了收集操作
If you are using Spark 2+, here's a suggestion: 如果您使用的是Spark 2+,则有以下建议:
val inputToMap = inputSmall.select(
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
)
val cols = inputToMap.columns
val localData = inputToMap.collect
cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap
I'm not sure I follow the motivation, but I think this is the transformation that would get you the result you're after: 我不确定我是否遵循这种动机,但是我认为这是可以让您获得所追求结果的转换:
// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()
// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
.grouped(2)
.collect { case Array(k, v) => (k, v) }
.toList
// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
.map { case (k, v) => k -> data.map(r => (r.getAs[String](k), r.getAs[Double](v))) }
.toMap
.mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })
result.foreach(println)
// prints:
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.