简体   繁体   English

Spark在同一数据集上减少并聚合

[英]Spark reduce and aggregate on same data-set

I have a text file which I read and then split using the split operation. 我有一个文本文件,我读取,然后使用split操作split This results in an RDD with Array(A, B, C, D, E, F, G, H, I) . 这导致具有Array(A, B, C, D, E, F, G, H, I)的RDD。

I would like to find max(F) - min(G) for every key E (reduce by key E ). 我想找到max(F) - min(G)为每一个关键E (按键减少E )。 Then I want to combine the resulting values by key C and concatenate this sum result for every row with the same key. 然后我想用键C组合结果值,并用相同的键连接每一行的总和结果。

For example: 例如:

+--+--+--+--+
| C| E| F| G|
+--+--+--+--+
|en| 1| 3| 1|
|en| 1| 4| 0|
|nl| 2| 1| 1|
|nl| 2| 5| 2|
|nl| 3| 9| 3|
|nl| 3| 6| 4|
|en| 4| 9| 1|
|en| 4| 2| 1|
+-----------+

Should result in 应该导致

+--+--+-------------+---+
| C| E|max(F)-min(G)|sum|
+--+--+-------------+---+
|en| 1| 4           |12 |
|nl| 2| 4           |10 |
|nl| 3| 6           |10 |
|en| 4| 8           |12 |
+--+--+-------------+---+

What would be the best way to tackle this? 解决这个问题的最佳方法是什么? Currently I am trying to perform the max(F)-min(G) by running 目前我正在尝试通过运行来执行max(F)-min(G)

val maxCounts = logEntries.map(line => (line(4), line(5).toLong)).reduceByKey((x, y) => math.max(x, y))
val minCounts = logEntries.map(line => (line(4), line(6).toLong)).reduceByKey((x, y) => math.min(x, y))

val maxMinCounts = maxCounts.join(minCounts).map{ case(id, maxmin) => (id, (maxmin._1 - maxmin._2)) }

And then join the resulting RDDs. 然后join生成的RDD。 However, this becomes tricky when I also want to sum these values and append them to my existing data set. 但是,当我还要将这些值相加并将它们附加到我现有的数据集时,这变得很棘手。

I would love to hear any suggestions! 我很乐意听到任何建议!

This kind of logic is easily implemented in the dataframe API (also). 这种逻辑很容易在数据帧API(也)中实现。 But you need to explicitly form your columns from the array: 但是您需要从数组中明确地形成列:

val window = Window.partitionBy('C)

val df = rdd
  .map { case Array(_, _, c, _, e, f, g, _, _) => (c,e,f,g) }
  .toDF("C","E","F","G")
  .groupBy('C,'E)
  .agg((max('F) - min('G)).as("diff"))
  .withColumn("sum",sum('diff).over(window))   

assuming, like your sample data, that unique E's never span multiple C's... you could do something like this. 假设,就像你的样本数据一样,唯一的E从不跨越多个C ......你可以做这样的事情。

import math.{max,min}

case class FG(f: Int, g: Int) {
  def combine(that: FG) =
    FG(max(f, that.f), min(g, that.g))
  def result = f - g 
}

val result = {
  rdd
  .map{ case Array(_, _, c, _, e, f, g, _, _) => 
    ((c, e), FG(f, g)) }
  .reduceByKey(_ combine _)
  .map{ case ((c, _), fg) =>
    (c, fg.result) }
  .reduceByKey(_+_)  
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM