在Scala中按大小拆分JSON-s的RDD

Question

Suppose we have a lot of JSON-s in HDFS, but for a prototype we load some JSON-s locally into Spark with: 假设我们在HDFS中有很多JSON，但是对于一个原型，我们使用以下代码在本地将一些JSON加载到Spark中：

val eachJson = sc.textFile("JSON_Folder/*.json")

I want to write a job which goes through the eachJson RDD[String] and calculates the size of each JSON. 我想编写一个通过eachJson RDD [String]并计算每个JSON大小的作业。 The size then is added to an accumulator and the corresponding JSON is added to a StringBuilder . 然后将大小添加到累加器，并将相应的JSON添加到StringBuilder 。 But when the size of the concatenated JSON-s exceeds a threshold, then we start to store the other JSON-s in a new StringBuilder . 但是，当串联的JSON-s的大小超过阈值时，我们便开始将其他JSON-s存储在新的StringBuilder 。

For instance, if we have 100 JSON-s, and we start to calculate the size of them one by one, we observe that from the 32th element the size of the concatenated JSON-s exceeds the threshold, then we group together only the first 31 JSON-s. 例如，如果我们有100个JSON-s，并且开始逐个计算它们的大小，我们会发现从第32个元素开始，串联的JSON-s的大小超过了阈值，那么我们仅将第一个组合在一起31个JSON-s。 After that we start again from the 32th element. 之后，我们从第32个元素开始。

What I managed to do until now is to obtain the indexes where we have to split the RDD based on the following code: 到目前为止，我设法做的工作是获取必须根据以下代码拆分RDD的索引：

eachJson.collect()
  .map(_.getBytes("UTF-8").length)
  .scanLeft(0){_ + _}
  .takeWhile(_ < 20000) //threshold = 20000
  .length-1

Also I tried: 我也尝试过：

val accum = sc.accumulator(0, "My Accumulator")
val buf = new StringBuilder
while(accum.value < 20000)
  {
    for(i <- eachJson)
      {
        accum.add(i.getBytes("UTF-8").length)
        buf ++= i
      }    
  }

But I receive the following error: org.apache.spark.SparkException: Task not serializable . 但是我收到以下错误： org.apache.spark.SparkException: Task not serializable 。

How can I do this in Spark via Scala? 如何通过Scala在Spark中执行此操作？ I use Spark 1.6.0 and Scala 2.10.6 我使用Spark 1.6.0和Scala 2.10.6

Answer 1

Not an answer; 没有答案； just to point you to the right direction. 只是为了指出正确的方向。 You get "Task is not serializable" exception because your val buf = new StringBuilder is used inside RDD's foreach ( for(i <- eachJson) ). 您会收到“任务不可序列化”异常，因为在RDD的foreach使用了val buf = new StringBuilder （ for(i <- eachJson) ）。 Spark cannot distribute your buf variable as StringBuilder itself is not serializable. 由于StringBuilder本身不可序列化，因此Spark无法分发您的buf变量。 Besides you shouldn't access mutable state directly. 此外，您不应该直接访问可变状态。 So recommendation is to put all data you need to Accumulator , not just sizes: 因此，建议将所需的所有数据放入Accumulator ，而不仅仅是大小：

case class MyAccumulator(size: Int, result: String)

And use something like rdd.aggregate or rdd.fold : 并使用rdd.aggregate或rdd.fold类的东西：

eachJson.fold(MyAccumulator(0, ""))(...)

//or

eachJson.fold(List.empty[MyAccumulator])(...)

Or just use it with scanLeft as you collect anyway. 或在collect将其与scanLeft一起使用。

Be aware that this won't be scalable (same as StringBuilder / collect solution). 请注意，这将不可扩展（与StringBuilder / collect解决方案相同）。 In order to make it scalable - use mapPartitions . 为了使其具有可伸缩性，请使用mapPartitions 。

Update. 更新。 mapPartitions would give you an ability to partially aggregate your JSONs as you would get "local" iterator (partition) as your input - you can operate it as a regular scala collection. mapPartitions使您能够部分聚合JSON，因为您将获得“本地”迭代器（分区）作为输入-您可以将其作为常规的scala集合进行操作。 It might be enough if you ok with some small percent JSONs not being concatenated. 如果您可以确定没有串联一些小的JSON，那可能就足够了。

 eachJson.mapPartitions{ localCollection =>
    ... //compression logic here
 }

Answer 2

Spark's progamming model is not ideal for what you are trying to achieve, if we take the general problem of "aggregating elements depending on something that can only be known by inspecting previous elements", for two reasons : 如果我们采用“汇总元素取决于只能通过检查以前的元素才能知道的内容”这一普遍问题，Spark的编程模型对于您要实现的目标不是理想的，这有两个原因：

Spark does not, generally speaking, impose an ordering over the datas (but it can do it) 一般来说，Spark不会对数据强加排序（但是可以做到）
Sparks deals with datas in partitions, and the sizes of the partitions are not usually (eg by default) dependant on the contents of the data, but by a default partitionner whose role is to divide datas evenly into partitions. Sparks处理分区中的数据，并且分区的大小通常不（例如，默认情况下）取决于数据的内容，而是由默认的分区器负责，该角色的作用是将数据均匀地划分为多个分区。

So it's not really a question of possible (it is), it rather is a question of "how much does it cost" (CPU / memory / time), for what it buys you. 因此，这实际上不是一个可能的问题，而是“购买成本”（CPU /内存/时间）（“多少钱”）的问题。

A draft for an exact solution 精确解决方案的草稿

If I were to shoot for an exact solution (by exact, I mean : preserving elements order, defined by, eg a timestamp in the JSONs, and grouping exactly consecutive inputs to the largest amount that approaches the boundary), I would : 如果我要寻求一个精确的解决方案（确切地说，我的意思是：保留元素顺序，例如由JSON中的时间戳定义，并将完全连续的输入分组为接近边界的最大数量），我将：

Impose an ordering on the RDD (there is a sortBy function, which does that) : this is a full data shuffle, so it IS expensive. 在RDD上强加一个排序（有一个sortBy函数可以做到这一点）：这是一个完整的数据混洗，因此很昂贵。
Give each row an id, after the sort, (there is a RDD version of zipWithIndex which respects ordering on the RDD, if it exists. There is also a faster dataframe equivalent, that creates monotically increasing indexes, albeit non consecutive ones). 在排序之后，为每一行提供一个ID（如果存在zipWithIndex，则有一个RDD版本，该版本尊重RDD上的排序。还有一个等效的数据帧，它可以创建单数递增的索引，尽管不是连续的）。
Collect the fraction of the result that is necessary to calculate size boundaries (the boundaries being the ids defined at step 2), pretty much as you did. 收集与计算尺寸边界（边界是在第2步中定义的id）所必需的结果的一部分，与您所做的差不多。 This again is a full pass on the datas. 这又是一次完整的数据传递。
Create a partitionner of datas that respects these boundaries (eg make sure that each elements of a single boundary are all in the same partition), and apply this partitionner to the RDD obtained at step 2 (another full shuffle on the datas). 创建一个尊重这些边界的数据分区器（例如，确保单个边界的每个元素都在同一个分区中），然后将此分区器应用于在步骤2中获得的RDD（对数据进行另一次完全混洗）。 You just got yourself partitions that are logically equivalent to what you expect, eg groups of elements whose sum of sizes is under a certain limit. 您得到的分区在逻辑上与期望的分区相等，例如，大小总和在一定限制内的元素组。 But the ordering inside each partition may have been lost in the repartitionning process. 但是每个分区内的顺序可能在重新分区过程中丢失了。 So you are not over yet ! 所以你还没有结束！
Then I would mapPartitions on this result to : 然后，我将基于此结果的MapPartitions映射到：
5.1. 5.1。 resort the datas locally to each partition, 将数据本地化到每个分区，
5.2. 5.2。 group items in the data structure I need once sorted 排序后需要的数据结构中的分组项

One of the key being not to apply anything that messes with partitions between step 4 and 5. As long as the "partition map" fits into the driver's memory, this is almost a practical solution, but a very costly one. 关键之一是不要应用任何在步骤4和5之间混乱的分区。只要“分区图”适合驱动程序的内存，这几乎是一种实用的解决方案，但代价却很高。

A simpler version (with relaxed constraints) 较简单的版本（约束宽松）

If it is ok for groups not to reach an optimal size, then the solution becomes much simpler (and it respects the ordering of the RDD if you have set one) : it is pretty much what you would code if there was no Spark at all, just an Iterator of JSON files. 如果组不能达到最佳大小是可以的，那么解决方案将变得更加简单（并且如果您设置了RDD，它会遵循RDD的顺序）：如果根本没有Spark，则几乎可以编写代码，只是JSON文件的迭代器。

Personnaly, I'd define a recursive accumulator function (nothing spark related) like so (I guess you could write your shorter, more efficient version using takeWhile) : 就个人而言，我将像这样定义一个递归累加器函数（与火花无关）（我想您可以使用takeWhile编写更短，更有效的版本）：

  /**
    * Aggregate recursively the contents of an iterator into a Seq[Seq[]]
    * @param remainingJSONs the remaining original JSON contents to be aggregated
    * @param currentAccSize the size of the active accumulation
    * @param currentAcc the current aggregation of json strings
    * @param resultAccumulation the result of aggregated JSON strings
    */
  @tailrec
  def acc(remainingJSONs: Iterator[String], currentAccSize: Int, currentAcc: Seq[String], resultAccumulation: Seq[Seq[String]]): Seq[Seq[String]] = {
    // IF there is nothing more in the current partition
    if (remainingJSONs.isEmpty) {
      // And were not in the process of acumulating
      if (currentAccSize == 0)
        // Then return what was accumulated before
        resultAccumulation
      else
        // Return what was accumulated before, and what was in the process of being accumulated
        resultAccumulation :+ currentAcc
    } else {
      // We still have JSON items to process
      val itemToAggregate = remainingJSONs.next()
      // Is this item too large for the current accumulation ?
      if (currentAccSize + itemToAggregate.size > MAX_SIZE) {
        // Finish the current aggregation, and proceed with a fresh one
        acc(remainingJSONs, itemToAggregate.size, Seq(itemToAggregate), resultAccumulation :+ currentAcc)
      } else {
        // Accumulate the current item on top of the current aggregation
        acc(remainingJSONs, currentAccSize + itemToAggregate.size, currentAcc :+ itemToAggregate, resultAccumulation)
      }
    }
  }

No you take this accumulating code, and make it run for each partition of spark's dataframe : 不，您需要使用此累积代码，并使其针对spark数据帧的每个分区运行：

val jsonRDD = ...
val groupedJSONs = jsonRDD.mapPartitions(aPartition => {
  acc(aPartition, 0, Seq(), Seq()).iterator
})

This will turn your RDD[String] into a RDD[Seq[String]] where each Seq[String] is made of consecutive RDD elements (which may be predictible if the RDD has been sorted, and may not otherwise), whose total length is below the threshold. 这会将您的RDD[String]转换为RDD[Seq[String]] ，其中每个Seq[String]由连续的RDD元素组成（如果对RDD进行了排序，这是可以预测的，否则可能无法预测），它们的总长度低于阈值。 What may be "sub-optimal" is that, at the end of each partition, may lie a Seq[String] with just a few (possibly, a single) JSONs, while at the beginning of the following partition, a full one was created. 所谓“次优”的意思是，在每个分区的末尾，可能存在一个Seq[String]其中只有几个（可能是单个）JSON，而在下一个分区的开头，则是一个完整的创建。

在Scala中按大小拆分JSON-s的RDD

问题描述

2 个解决方案

解决方案1
1 2017-06-26 14:01:43

解决方案2
1 已采纳 2017-06-26 16:25:15

A draft for an exact solution 精确解决方案的草稿

A simpler version (with relaxed constraints) 较简单的版本（约束宽松）

在Scala中按大小拆分JSON-s的RDD

问题描述

2 个解决方案

解决方案1 1 2017-06-26 14:01:43

解决方案2 1 已采纳 2017-06-26 16:25:15

A draft for an exact solution 精确解决方案的草稿

A simpler version (with relaxed constraints) 较简单的版本（约束宽松）

解决方案1
1 2017-06-26 14:01:43

解决方案2
1 已采纳 2017-06-26 16:25:15