Spark flatMap / reduce：如何缩放和避免OutOfMemory？

Question

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function. 我正在将一些map-reduce代码迁移到Spark中，并且在构造Iterable返回函数时遇到问题。 In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified): 在MR代码中，我有一个由键分组的reduce函数，然后（使用multipleOutputs）对值进行迭代，并使用写入（在多个输出中，但这并不重要）这样的代码（简化）：

reduce(Key key, Iterable<Text> values) {
    // ... some code
    for (Text xml: values) {
        multipleOutputs.write(key, val, directory);
    }
}

However, in Spark I have translated a map and this reduce into a sequence of: mapToPair -> groupByKey -> flatMap as recommended... in some book. 但是，在Spark中，我翻译了地图，并将其简化为以下序列：mapToPair-> groupByKey-> flatMap，在某些书中建议...。

mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. mapToPair基本上是通过functionMap添加一个Key，该键基于记录上的某些值为该记录创建一个Key。 Sometimes a key may have ver high cardinality. 有时，密钥可能具有很高的基数。

JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() { 
    public Tuple2<Key, String> call(String value) {
        //... 
        return functionMap.call(value);
    }
});

The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function: 向rddPaired应用了RDD.groupByKey（）以获取RDD来提供flatMap函数：

JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();

Once grouped, a flatMap call to do the reduce . 分组后，flatMap调用执行reduce 。 Here, operation is a transformation : 在这里，操作是一个转换：

public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
    // some code...
    List<String> out = new ArrayList<String>();
    if (someConditionOnKey) { 
        // do a logic
        Grouper grouper = new Grouper();
        for (String xml : keyValue._2()) {
            // group in a separate class
            grouper.add(xml);
        }
        // operation is now performed on the whole group
        out.add(operation(grouper));
    } else {
        for (String xml : keyValue._2()) {
            out.add(operation(xml));
        }
        return out;
    }
}

It works fine... with keys that don't have too many records. 它的工作原理很好……使用没有太多记录的键。 Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce. 实际上，当具有大量值的键在化简中输入“ else”时，它将被OutOfMemory中断。

Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data. 注意：我已经包括了“ if”部分来解释我要产生的逻辑，但是失败是在输入“ else”时发生的……因为当数据输入“ else”时，通常意味着会有更多的值由于数据的性质。

It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. 显然，必须将所有分组值保留在“外”列表中，如果键具有数百万条记录，它将无法扩展，因为它将把它们保留在内存中。 I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though). 我已经到了发生OOM的地步（是的，这是在执行上面的“操作”时需要内存的操作-没有给出任何操作。尽管这不是很昂贵的内存操作）。

Is there any way to avoid this in order to scale? 有什么办法可以避免这种情况以扩展规模？ Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)... 通过使用其他一些指令复制行为以更可扩展的方式达到相同的输出，或者能够派出Spark值进行合并（就像我以前对MR所做的那样）...

Answer 1

It's inefficient to do condition inside the flatMap operation. 在flatMap操作中进行条件处理效率很低。 You should check the condition outside to create 2 distinct RDDs and deal with them separatedly. 您应该检查外部条件以创建2个不同的RDD并分别处理它们。

rddPaired.cache();

// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);

// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

Spark flatMap / reduce：如何缩放和避免OutOfMemory？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-12 10:46:30

Spark flatMap / reduce：如何缩放和避免OutOfMemory？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-12 10:46:30

解决方案1
1 已采纳 2016-07-12 10:46:30