Spark 2在分区上迭代以创建新分区

Question

I have been scratching my head trying to come up with a way to reduce a dataframe in spark to a frame which records gaps in the dataframe, preferably without completely killing parallelism. 我一直在想办法想出一种方法来将火花中的数据帧减少到一个记录数据帧中间隙的帧，最好不要完全消除并行性。 Here is a much-simplified example (It's a bit lengthy because I wanted it to be able to run): 这是一个大大简化的示例（这有点冗长，因为我希望它能够运行）：

import org.apache.spark.sql.SparkSession

case class Record(typ: String, start: Int, end: Int);

object Sample {
    def main(argv: Array[String]): Unit = {
        val sparkSession = SparkSession.builder()
            .master("local")
            .getOrCreate();

        val df = sparkSession.createDataFrame(
            Seq(
                Record("One", 0, 5),
                Record("One", 10, 15),
                Record("One", 5, 8),
                Record("Two", 10, 25),
                Record("Two", 40, 45),
                Record("Three", 30, 35)
            )
        );

        df.repartition(df("typ")).sortWithinPartitions(df("start")).show();
    }
}

When I get done I would like to be able to output a dataframe like this: 完成后，我希望能够输出如下数据框：

typ   start    end
---   -----    ---
One   0        8
One   10       15
Two   10       25
Two   40       45
Three 30       35

I guessed that partitioning by the 'typ' value would give me partitions with each distinct data value, 1-1, EG in the sample I would end up with three partions, one each for 'One', 'Two' and 'Three'. 我猜想按“ typ”值进行分区将为我提供样本中每个不同数据值（1-1，EG）的分区，最终我将分为三部分，每一个分别代表“一个”，“两个”和“三个” 。 Furthermore, the sortWithinPartitions call is intended to give me each partition in sorted order on 'start' so that I can iterate from the beginning to the end and record gaps. 此外，sortWithinPartitions调用旨在在“开始”时按排序顺序为我提供每个分区，以便我可以从头到尾进行迭代并记录间隔。 That last part is where I am stuck. 最后一部分是我被困住的地方。 Is this possible? 这可能吗？ If not, is there another approach that is? 如果没有，还有另一种方法吗？

Answer 1

I propose to skip the repartitioning and the sorting steps, and jump directly to a distributed compressed merge sort (I've just invented the name for the algorithm, just like the algorithm itself). 我建议跳过重新分区和排序步骤，而直接跳转到分布式压缩合并排序 （我刚刚发明了算法的名称，就像算法本身一样）。

Here is the part of the algorithm that is supposed to be used as reduce operation: 这是应该用作reduce运算的算法部分：

  type Gap = (Int, Int)

  def mergeIntervals(as: List[Gap], bs: List[Gap]): List[Gap] = {
    require(!as.isEmpty, "as must be non-empty")
    require(!bs.isEmpty, "bs must be non-empty")

    @annotation.tailrec
    def mergeRec(
      gaps: List[Gap],
      gapStart: Int,
      gapEndAccum: Int,
      as: List[Gap],
      bs: List[Gap]
    ): List[Gap] = {
      as match {
        case Nil => {
          bs match {
            case Nil => (gapStart, gapEndAccum) :: gaps
            case notEmpty => mergeRec(gaps, gapStart, gapEndAccum, bs, Nil)
          }
        }
        case (a0, a1) :: at => {
          if (a0 <= gapEndAccum) {
            mergeRec(gaps, gapStart, gapEndAccum max a1, at, bs)
          } else {
            bs match {
              case Nil => mergeRec((gapStart, gapEndAccum) :: gaps, a0, gapEndAccum max a1, at, bs)
              case (b0, b1) :: bt => if (b0 <= gapEndAccum) {
                mergeRec(gaps, gapStart, gapEndAccum max b1, as, bt)
              } else {
                if (a0 < b0) {
                  mergeRec((gapStart, gapEndAccum) :: gaps, a0, a1, at, bs)
                } else {
                  mergeRec((gapStart, gapEndAccum) :: gaps, b0, b1, as, bt)
                }
              }
            }
          }
        }
      }
    }
    val (a0, a1) :: at = as
    val (b0, b1) :: bt = bs

    val reverseRes = 
      if (a0 < b0) 
        mergeRec(Nil, a0, a1, at, bs)
      else
        mergeRec(Nil, b0, b1, as, bt)

    reverseRes.reverse
  }

It works as follows: 其工作方式如下：

  println(mergeIntervals(
    List((0, 3), (4, 7), (9, 11), (15, 16), (18, 22)),
    List((1, 2), (4, 5), (6, 10), (12, 13), (15, 17))
  ))

  // Outputs:
  // List((0,3), (4,11), (12,13), (15,17), (18,22))

Now, if you combine it with the parallel reduce of Spark, 现在，如果将其与Spark的并行reduce结合使用，

  val mergedIntervals = df.
    as[(String, Int, Int)].
    rdd.
    map{case (t, s, e) => (t, List((s, e)))}.              // Convert start end to list with one interval
    reduceByKey(mergeIntervals).                           // perform parallel compressed merge-sort
    flatMap{ case (k, vs) => vs.map(v => (k, v._1, v._2))}.// explode resulting lists of merged intervals
    toDF("typ", "start", "end")                            // convert back to DF

  mergedIntervals.show()

you obtain something like a parallel merge sort that works directly on compressed representations of integer sequences (thus the name). 您将获得类似于并行合并排序的功能，该功能可直接用于整数序列的压缩表示（因此称为名称）。

The result: 结果：

+-----+-----+---+
|  typ|start|end|
+-----+-----+---+
|  Two|   10| 25|
|  Two|   40| 45|
|  One|    0|  8|
|  One|   10| 15|
|Three|   30| 35|
+-----+-----+---+

Discussion 讨论区

The mergeIntervals method implements a commutative, associative operation for merging lists of non-overlapping intervals that are already sorted in increasing order. mergeIntervals方法实现了一种交换关联操作，用于合并已经按升序排序的非重叠间隔列表。 All the overlapping intervals are then merged, and again stored in increasing order. 然后将所有重叠的间隔合并，并再次以递增顺序存储。 This procedure can be repeated in a reduce step until all interval sequences are merged. 可以在reduce步骤中重复此过程，直到合并所有间隔序列。

The interesting property of the algorithm is that it maximally compresses every intermediate result of reduction. 该算法有趣的特性是，它最大程度地压缩了还原的每个中间结果。 Thus, if you have many intervals with a lot of overlap, this algorithm might actually be faster then other algorithms that are based on sorting of input intervals. 因此，如果您有很多间隔且重叠很多，则此算法实际上可能比其他基于输入间隔排序的算法要快。

However, if you have very many intervals with very seldom overlaps, then this method might run out of memory and not work at all, so that other algorithms must be used that first sort the intervals, and then make some kind of scan and merge adjacent intervals locally. 但是，如果您有很多间隔而很少重叠，则此方法可能会耗尽内存并且根本无法工作，因此必须使用其他算法来首先对间隔进行排序，然后进行某种扫描并合并到相邻区域本地间隔。 So, whether this will work or not depends on the use-case. 因此，这是否可行取决于用例。

Full code 完整代码

  val df = Seq(
    ("One", 0, 5),
    ("One", 10, 15),
    ("One", 5, 8),
    ("Two", 10, 25),
    ("Two", 40, 45),
    ("Three", 30, 35)
  ).toDF("typ", "start", "end")

  type Gap = (Int, Int)
  /** The `merge`-step of a variant of merge-sort
    * that works directly on compressed sequences of integers,
    * where instead of individual integers, the sequence is 
    * represented by sorted, non-overlapping ranges of integers.
    */
  def mergeIntervals(as: List[Gap], bs: List[Gap]): List[Gap] = {
    require(!as.isEmpty, "as must be non-empty")
    require(!bs.isEmpty, "bs must be non-empty")
    // assuming that `as` and `bs` both are either lists with a single
    // interval, or sorted lists that arise as output of
    // this method, recursively merges them into a single list of
    // gaps, merging all overlapping gaps.
    @annotation.tailrec
    def mergeRec(
      gaps: List[Gap],
      gapStart: Int,
      gapEndAccum: Int,
      as: List[Gap],
      bs: List[Gap]
    ): List[Gap] = {
      as match {
        case Nil => {
          bs match {
            case Nil => (gapStart, gapEndAccum) :: gaps
            case notEmpty => mergeRec(gaps, gapStart, gapEndAccum, bs, Nil)
          }
        }
        case (a0, a1) :: at => {
          if (a0 <= gapEndAccum) {
            mergeRec(gaps, gapStart, gapEndAccum max a1, at, bs)
          } else {
            bs match {
              case Nil => mergeRec((gapStart, gapEndAccum) :: gaps, a0, gapEndAccum max a1, at, bs)
              case (b0, b1) :: bt => if (b0 <= gapEndAccum) {
                mergeRec(gaps, gapStart, gapEndAccum max b1, as, bt)
              } else {
                if (a0 < b0) {
                  mergeRec((gapStart, gapEndAccum) :: gaps, a0, a1, at, bs)
                } else {
                  mergeRec((gapStart, gapEndAccum) :: gaps, b0, b1, as, bt)
                }
              }
            }
          }
        }
      }
    }
    val (a0, a1) :: at = as
    val (b0, b1) :: bt = bs

    val reverseRes = 
      if (a0 < b0) 
        mergeRec(Nil, a0, a1, at, bs)
      else
        mergeRec(Nil, b0, b1, as, bt)

    reverseRes.reverse
  }


  val mergedIntervals = df.
    as[(String, Int, Int)].
    rdd.
    map{case (t, s, e) => (t, List((s, e)))}.              // Convert start end to list with one interval
    reduceByKey(mergeIntervals).                           // perform parallel compressed merge-sort
    flatMap{ case (k, vs) => vs.map(v => (k, v._1, v._2))}.// explode resulting lists of merged intervals
    toDF("typ", "start", "end")                            // convert back to DF

  mergedIntervals.show()

Testing 测试中

The implementation of mergeIntervals is tested a little bit. 对mergeIntervals的实现进行了一些测试。 If you want to actually incorporate it into your codebase, here is at least a sketch of one repeated randomized test for it: 如果您想将其实际合并到代码库中，至少这里是对其进行一次重复随机测试的示意图：

  def randomIntervalSequence(): List[Gap] = {
    def recHelper(acc: List[Gap], open: Option[Int], currIdx: Int): List[Gap] = {
      if (math.random > 0.999) acc.reverse
      else {
        if (math.random > 0.90) {
          if (open.isEmpty) {
            recHelper(acc, Some(currIdx), currIdx + 1)
          } else {
            recHelper((open.get, currIdx) :: acc, None, currIdx + 1)
          }
        } else {
          recHelper(acc, open, currIdx + 1)
        }
      }
    }
    recHelper(Nil, None, 0)
  }

  def intervalsToInts(is: List[Gap]): List[Int] = is.flatMap{ case (a, b) => a to b }

  var numNonTrivialTests = 0
  while(numNonTrivialTests < 1000) {
    val as = randomIntervalSequence()
    val bs = randomIntervalSequence()
    if (!as.isEmpty && !bs.isEmpty) {
      numNonTrivialTests += 1
      val merged = mergeIntervals(as, bs)
      assert((intervalsToInts(as).toSet ++ intervalsToInts(bs)) == intervalsToInts(merged).toSet)
    }
  }

You would obviously have to replace the raw assert by something more civilized, depending on your framework. 显然，您将不得不用更加文明的方式来代替原始assert ，具体取决于您的框架。

Spark 2在分区上迭代以创建新分区

问题描述

1 个解决方案

解决方案1
0 2018-02-26 19:46:40

Spark 2在分区上迭代以创建新分区

问题描述

1 个解决方案

解决方案1 0 2018-02-26 19:46:40

解决方案1
0 2018-02-26 19:46:40