[英]Split RDD of JSON-s by size in Scala
Suppose we have a lot of JSON-s in HDFS, but for a prototype we load some JSON-s locally into Spark with: 假设我们在HDFS中有很多JSON,但是对于一个原型,我们使用以下代码在本地将一些JSON加载到Spark中:
val eachJson = sc.textFile("JSON_Folder/*.json")
I want to write a job which goes through the eachJson
RDD[String] and calculates the size of each JSON. 我想编写一个通过
eachJson
RDD [String]并计算每个JSON大小的作业。 The size then is added to an accumulator and the corresponding JSON is added to a StringBuilder
. 然后将大小添加到累加器,并将相应的JSON添加到
StringBuilder
。 But when the size of the concatenated JSON-s exceeds a threshold, then we start to store the other JSON-s in a new StringBuilder
. 但是,当串联的JSON-s的大小超过阈值时,我们便开始将其他JSON-s存储在新的
StringBuilder
。
For instance, if we have 100 JSON-s, and we start to calculate the size of them one by one, we observe that from the 32th element the size of the concatenated JSON-s exceeds the threshold, then we group together only the first 31 JSON-s. 例如,如果我们有100个JSON-s,并且开始逐个计算它们的大小,我们会发现从第32个元素开始,串联的JSON-s的大小超过了阈值,那么我们仅将第一个组合在一起31个JSON-s。 After that we start again from the 32th element.
之后,我们从第32个元素开始。
What I managed to do until now is to obtain the indexes where we have to split the RDD based on the following code: 到目前为止,我设法做的工作是获取必须根据以下代码拆分RDD的索引:
eachJson.collect()
.map(_.getBytes("UTF-8").length)
.scanLeft(0){_ + _}
.takeWhile(_ < 20000) //threshold = 20000
.length-1
Also I tried: 我也尝试过:
val accum = sc.accumulator(0, "My Accumulator")
val buf = new StringBuilder
while(accum.value < 20000)
{
for(i <- eachJson)
{
accum.add(i.getBytes("UTF-8").length)
buf ++= i
}
}
But I receive the following error: org.apache.spark.SparkException: Task not serializable
. 但是我收到以下错误:
org.apache.spark.SparkException: Task not serializable
。
How can I do this in Spark via Scala? 如何通过Scala在Spark中执行此操作? I use Spark 1.6.0 and Scala 2.10.6
我使用Spark 1.6.0和Scala 2.10.6
Not an answer; 没有答案; just to point you to the right direction.
只是为了指出正确的方向。 You get "Task is not serializable" exception because your
val buf = new StringBuilder
is used inside RDD's foreach
( for(i <- eachJson)
). 您会收到“任务不可序列化”异常,因为在RDD的
foreach
使用了val buf = new StringBuilder
( for(i <- eachJson)
)。 Spark cannot distribute your buf
variable as StringBuilder
itself is not serializable. 由于
StringBuilder
本身不可序列化,因此Spark无法分发您的buf
变量。 Besides you shouldn't access mutable state directly. 此外,您不应该直接访问可变状态。 So recommendation is to put all data you need to
Accumulator
, not just sizes: 因此,建议将所需的所有数据放入
Accumulator
,而不仅仅是大小:
case class MyAccumulator(size: Int, result: String)
And use something like rdd.aggregate
or rdd.fold
: 并使用
rdd.aggregate
或rdd.fold
类的东西:
eachJson.fold(MyAccumulator(0, ""))(...)
//or
eachJson.fold(List.empty[MyAccumulator])(...)
Or just use it with scanLeft
as you collect
anyway. 或在
collect
将其与scanLeft
一起使用。
Be aware that this won't be scalable (same as StringBuilder
/ collect
solution). 请注意,这将不可扩展(与
StringBuilder
/ collect
解决方案相同)。 In order to make it scalable - use mapPartitions
. 为了使其具有可伸缩性,请使用
mapPartitions
。
Update. 更新。 mapPartitions would give you an ability to partially aggregate your JSONs as you would get "local" iterator (partition) as your input - you can operate it as a regular scala collection.
mapPartitions使您能够部分聚合JSON,因为您将获得“本地”迭代器(分区)作为输入-您可以将其作为常规的scala集合进行操作。 It might be enough if you ok with some small percent JSONs not being concatenated.
如果您可以确定没有串联一些小的JSON,那可能就足够了。
eachJson.mapPartitions{ localCollection =>
... //compression logic here
}
Spark's progamming model is not ideal for what you are trying to achieve, if we take the general problem of "aggregating elements depending on something that can only be known by inspecting previous elements", for two reasons : 如果我们采用“汇总元素取决于只能通过检查以前的元素才能知道的内容”这一普遍问题,Spark的编程模型对于您要实现的目标不是理想的,这有两个原因:
So it's not really a question of possible (it is), it rather is a question of "how much does it cost" (CPU / memory / time), for what it buys you. 因此,这实际上不是一个可能的问题,而是“购买成本”(CPU /内存/时间)(“多少钱”)的问题。
If I were to shoot for an exact solution (by exact, I mean : preserving elements order, defined by, eg a timestamp in the JSONs, and grouping exactly consecutive inputs to the largest amount that approaches the boundary), I would : 如果我要寻求一个精确的解决方案(确切地说,我的意思是:保留元素顺序,例如由JSON中的时间戳定义,并将完全连续的输入分组为接近边界的最大数量),我将:
sortBy
function, which does that) : this is a full data shuffle, so it IS expensive. sortBy
函数可以做到这一点):这是一个完整的数据混洗,因此很昂贵。 One of the key being not to apply anything that messes with partitions between step 4 and 5. As long as the "partition map" fits into the driver's memory, this is almost a practical solution, but a very costly one. 关键之一是不要应用任何在步骤4和5之间混乱的分区。只要“分区图”适合驱动程序的内存,这几乎是一种实用的解决方案,但代价却很高。
If it is ok for groups not to reach an optimal size, then the solution becomes much simpler (and it respects the ordering of the RDD if you have set one) : it is pretty much what you would code if there was no Spark at all, just an Iterator of JSON files. 如果组不能达到最佳大小是可以的,那么解决方案将变得更加简单(并且如果您设置了RDD,它会遵循RDD的顺序):如果根本没有Spark,则几乎可以编写代码,只是JSON文件的迭代器。
Personnaly, I'd define a recursive accumulator function (nothing spark related) like so (I guess you could write your shorter, more efficient version using takeWhile) : 就个人而言,我将像这样定义一个递归累加器函数(与火花无关)(我想您可以使用takeWhile编写更短,更有效的版本):
/**
* Aggregate recursively the contents of an iterator into a Seq[Seq[]]
* @param remainingJSONs the remaining original JSON contents to be aggregated
* @param currentAccSize the size of the active accumulation
* @param currentAcc the current aggregation of json strings
* @param resultAccumulation the result of aggregated JSON strings
*/
@tailrec
def acc(remainingJSONs: Iterator[String], currentAccSize: Int, currentAcc: Seq[String], resultAccumulation: Seq[Seq[String]]): Seq[Seq[String]] = {
// IF there is nothing more in the current partition
if (remainingJSONs.isEmpty) {
// And were not in the process of acumulating
if (currentAccSize == 0)
// Then return what was accumulated before
resultAccumulation
else
// Return what was accumulated before, and what was in the process of being accumulated
resultAccumulation :+ currentAcc
} else {
// We still have JSON items to process
val itemToAggregate = remainingJSONs.next()
// Is this item too large for the current accumulation ?
if (currentAccSize + itemToAggregate.size > MAX_SIZE) {
// Finish the current aggregation, and proceed with a fresh one
acc(remainingJSONs, itemToAggregate.size, Seq(itemToAggregate), resultAccumulation :+ currentAcc)
} else {
// Accumulate the current item on top of the current aggregation
acc(remainingJSONs, currentAccSize + itemToAggregate.size, currentAcc :+ itemToAggregate, resultAccumulation)
}
}
}
No you take this accumulating code, and make it run for each partition of spark's dataframe : 不,您需要使用此累积代码,并使其针对spark数据帧的每个分区运行:
val jsonRDD = ...
val groupedJSONs = jsonRDD.mapPartitions(aPartition => {
acc(aPartition, 0, Seq(), Seq()).iterator
})
This will turn your RDD[String]
into a RDD[Seq[String]]
where each Seq[String]
is made of consecutive RDD elements (which may be predictible if the RDD has been sorted, and may not otherwise), whose total length is below the threshold. 这会将您的
RDD[String]
转换为RDD[Seq[String]]
,其中每个Seq[String]
由连续的RDD元素组成(如果对RDD进行了排序,这是可以预测的,否则可能无法预测),它们的总长度低于阈值。 What may be "sub-optimal" is that, at the end of each partition, may lie a Seq[String]
with just a few (possibly, a single) JSONs, while at the beginning of the following partition, a full one was created. 所谓“次优”的意思是,在每个分区的末尾,可能存在一个
Seq[String]
其中只有几个(可能是单个)JSON,而在下一个分区的开头,则是一个完整的创建。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.