Spark在同一数据集上减少并聚合

Question

I have a text file which I read and then split using the split operation. 我有一个文本文件，我读取，然后使用split操作split 。 This results in an RDD with Array(A, B, C, D, E, F, G, H, I) . 这导致具有Array(A, B, C, D, E, F, G, H, I)的RDD。

I would like to find max(F) - min(G) for every key E (reduce by key E ). 我想找到max(F) - min(G)为每一个关键E （按键减少E ）。 Then I want to combine the resulting values by key C and concatenate this sum result for every row with the same key. 然后我想用键C组合结果值，并用相同的键连接每一行的总和结果。

For example: 例如：

+--+--+--+--+
| C| E| F| G|
+--+--+--+--+
|en| 1| 3| 1|
|en| 1| 4| 0|
|nl| 2| 1| 1|
|nl| 2| 5| 2|
|nl| 3| 9| 3|
|nl| 3| 6| 4|
|en| 4| 9| 1|
|en| 4| 2| 1|
+-----------+

Should result in 应该导致

+--+--+-------------+---+
| C| E|max(F)-min(G)|sum|
+--+--+-------------+---+
|en| 1| 4           |12 |
|nl| 2| 4           |10 |
|nl| 3| 6           |10 |
|en| 4| 8           |12 |
+--+--+-------------+---+

What would be the best way to tackle this? 解决这个问题的最佳方法是什么？ Currently I am trying to perform the max(F)-min(G) by running 目前我正在尝试通过运行来执行max(F)-min(G)

val maxCounts = logEntries.map(line => (line(4), line(5).toLong)).reduceByKey((x, y) => math.max(x, y))
val minCounts = logEntries.map(line => (line(4), line(6).toLong)).reduceByKey((x, y) => math.min(x, y))

val maxMinCounts = maxCounts.join(minCounts).map{ case(id, maxmin) => (id, (maxmin._1 - maxmin._2)) }

And then join the resulting RDDs. 然后join生成的RDD。 However, this becomes tricky when I also want to sum these values and append them to my existing data set. 但是，当我还要将这些值相加并将它们附加到我现有的数据集时，这变得很棘手。

I would love to hear any suggestions! 我很乐意听到任何建议！

Answer 1

This kind of logic is easily implemented in the dataframe API (also). 这种逻辑很容易在数据帧API（也）中实现。 But you need to explicitly form your columns from the array: 但是您需要从数组中明确地形成列：

val window = Window.partitionBy('C)

val df = rdd
  .map { case Array(_, _, c, _, e, f, g, _, _) => (c,e,f,g) }
  .toDF("C","E","F","G")
  .groupBy('C,'E)
  .agg((max('F) - min('G)).as("diff"))
  .withColumn("sum",sum('diff).over(window))

Answer 2

assuming, like your sample data, that unique E's never span multiple C's... you could do something like this. 假设，就像你的样本数据一样，唯一的E从不跨越多个C ......你可以做这样的事情。

import math.{max,min}

case class FG(f: Int, g: Int) {
  def combine(that: FG) =
    FG(max(f, that.f), min(g, that.g))
  def result = f - g 
}

val result = {
  rdd
  .map{ case Array(_, _, c, _, e, f, g, _, _) => 
    ((c, e), FG(f, g)) }
  .reduceByKey(_ combine _)
  .map{ case ((c, _), fg) =>
    (c, fg.result) }
  .reduceByKey(_+_)  
}

Spark在同一数据集上减少并聚合

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-10-04 18:16:28

解决方案2
1 2016-10-04 17:33:16

Spark在同一数据集上减少并聚合

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-10-04 18:16:28

解决方案2 1 2016-10-04 17:33:16

解决方案1
2 已采纳 2016-10-04 18:16:28

解决方案2
1 2016-10-04 17:33:16