简体   繁体   English

Spark flatMap / reduce:如何缩放和避免OutOfMemory?

[英]Spark flatMap/reduce: How to scale and avoid OutOfMemory?

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function. 我正在将一些map-reduce代码迁移到Spark中,并且在构造Iterable返回函数时遇到问题。 In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified): 在MR代码中,我有一个由键分组的reduce函数,然后(使用multipleOutputs)对值进行迭代,并使用写入(在多个输出中,但这并不重要)这样的代码(简化):

reduce(Key key, Iterable<Text> values) {
    // ... some code
    for (Text xml: values) {
        multipleOutputs.write(key, val, directory);
    }
}

However, in Spark I have translated a map and this reduce into a sequence of: mapToPair -> groupByKey -> flatMap as recommended... in some book. 但是,在Spark中,我翻译了地图,并将其简化为以下序列:mapToPair-> groupByKey-> flatMap,在某些书中建议...。

mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. mapToPair基本上是通过functionMap添加一个Key,该键基于记录上的某些值为该记录创建一个Key。 Sometimes a key may have ver high cardinality. 有时,密钥可能具有很高的基数。

JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() { 
    public Tuple2<Key, String> call(String value) {
        //... 
        return functionMap.call(value);
    }
});

The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function: rddPaired应用了RDD.groupByKey()以获取RDD来提供flatMap函数:

JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();

Once grouped, a flatMap call to do the reduce . 分组后,flatMap调用执行reduce Here, operation is a transformation : 在这里, 操作是一个转换:

public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
    // some code...
    List<String> out = new ArrayList<String>();
    if (someConditionOnKey) { 
        // do a logic
        Grouper grouper = new Grouper();
        for (String xml : keyValue._2()) {
            // group in a separate class
            grouper.add(xml);
        }
        // operation is now performed on the whole group
        out.add(operation(grouper));
    } else {
        for (String xml : keyValue._2()) {
            out.add(operation(xml));
        }
        return out;
    }
}

It works fine... with keys that don't have too many records. 它的工作原理很好……使用没有太多记录的键。 Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce. 实际上,当具有大量值的键在化简中输入“ else”时,它将被OutOfMemory中断。

Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data. 注意:我已经包括了“ if”部分来解释我要产生的逻辑,但是失败是在输入“ else”时发生的……因为当数据输入“ else”时,通常意味着会有更多的值由于数据的性质。

It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. 显然,必须将所有分组值保留在“外”列表中,如果键具有数百万条记录,它将无法扩展,因为它将把它们保留在内存中。 I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though). 我已经到了发生OOM的地步(是的,这是在执行上面的“操作”时需要内存的操作-没有给出任何操作。尽管这不是很昂贵的内存操作)。

Is there any way to avoid this in order to scale? 有什么办法可以避免这种情况以扩展规模? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)... 通过使用其他一些指令复制行为以更可扩展的方式达到相同的输出,或者能够派出Spark值进行合并(就像我以前对MR所做的那样)...

It's inefficient to do condition inside the flatMap operation. flatMap操作中进行条件处理效率很低。 You should check the condition outside to create 2 distinct RDDs and deal with them separatedly. 您应该检查外部条件以创建2个不同的RDD并分别处理它们。

rddPaired.cache();

// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);

// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在特定场景中避免OutOfMemory问题 - How to avoid OutOfMemory issues in particular scenario 如何避免由于String变量导致的outofMemory错误 - how to avoid outofMemory error due to String variable FlatMap和主题 - 如何避免重复订阅者? - FlatMap and subjects - how to avoid duplicating subscribers? 如何在 JSONArray 上应用平面图并在 Spark 中转换为 JSONObject? - How to apply flatmap on JSONArray and convert to JSONObject in Spark? 如何在带有Spark 2.1的Java中使用lambda flatMap() - How to use lambda flatMap() in Java with Spark 2.1 如何使用浮点数而不是整数比例缩放位图,而没有OutOfMemory错误? - How to scale a Bitmap with float instead of integer scales, without OutOfMemory Errors? Java中带有Spark 2.1.0的FlatMap - FlatMap in Java with Spark 2.1.0 如何避免与BitmapFactory.decodeByteArray不一致的OutOfMemory异常? - How to Avoid inconsistent OutOfMemory Exceptions with BitmapFactory.decodeByteArray? 如何避免Twitter4J流API的OutOfMemory错误? - How to avoid OutOfMemory error with Twitter4J streaming API? 在Java中读取大文件时如何避免OutOfMemory异常 - How to avoid OutOfMemory exception while reading large files in Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM