Spark flatMap/reduce: How to scale and avoid OutOfMemory?

Question

I am migrating some map-reduce code into Spark, and having problems when constructing an Iterable to return in the function. In MR code, I had a reduce function that grouped by key, and then (using multipleOutputs) would iterate the values and use write (in multiple outputs, but that's unimportant) to some code like this (simplified):

reduce(Key key, Iterable<Text> values) {
    // ... some code
    for (Text xml: values) {
        multipleOutputs.write(key, val, directory);
    }
}

However, in Spark I have translated a map and this reduce into a sequence of: mapToPair -> groupByKey -> flatMap as recommended... in some book.

mapToPair basically adds a Key via functionMap, which based on some values on the record creates a Key for that record. Sometimes a key may have ver high cardinality.

JavaPairRDD<Key, String> rddPaired = inputRDD.mapToPair(new PairFunction<String, Key, String>() { 
    public Tuple2<Key, String> call(String value) {
        //... 
        return functionMap.call(value);
    }
});

The rddPaired is applied a RDD.groupByKey() to get the RDD to feed the flatMap function:

JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.groupByKey();

Once grouped, a flatMap call to do the reduce . Here, operation is a transformation :

public Iterable<String> call (Tuple2<Key, Iterable<String>> keyValue) {
    // some code...
    List<String> out = new ArrayList<String>();
    if (someConditionOnKey) { 
        // do a logic
        Grouper grouper = new Grouper();
        for (String xml : keyValue._2()) {
            // group in a separate class
            grouper.add(xml);
        }
        // operation is now performed on the whole group
        out.add(operation(grouper));
    } else {
        for (String xml : keyValue._2()) {
            out.add(operation(xml));
        }
        return out;
    }
}

It works fine... with keys that don't have too many records. Actually, it breaks by OutOfMemory when a key with lot of values enters the "else" on the reduce.

Note: I have included the "if" part to explain the logic I want to produce, but the failure happens when entering the "else"... because when data enters the "else", it normally means there will be many more values for that due by the nature of the data.

It is clear that, having to keep all of the grouped values in "out" list, it won't scale if a key has millions of records, because it will keep them in memory. I have reached the point where the OOM happens (yes, it's when performing the "operation" above which asks for memory - and none is given. It's not a very expensive memory operation though).

Is there any way to avoid this in order to scale? Either by replicating behaviour using some other directives to reach the same output in a more scalable way, or to be able to hand to Spark the values for merging (just as I used to do with MR)...

Answer 1

It's inefficient to do condition inside the flatMap operation. You should check the condition outside to create 2 distinct RDDs and deal with them separatedly.

rddPaired.cache();

// groupFilterFunc will filter which items need grouping
JavaPairRDD<Key, Iterable<String>> rddGrouped = rddPaired.filter(groupFilterFunc).groupByKey();
// processGroupedValuesFunction should call `operation` on group of all values with the same key and return the result
rddGrouped.mapValues(processGroupedValuesFunction);

// nogroupFilterFunc will filter which items don't need grouping
JavaPairRDD<Key, Iterable<String>> rddNoGrouped = rddPaired.filter(nogroupFilterFunc);
// processNoGroupedValuesFunction2 should call `operation` on a single value and return the result
rddNoGrouped.mapValues(processNoGroupedValuesFunction2);

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

Question

1 answers

solution1
1 ACCPTED 2016-07-12 10:46:30

Spark flatMap/reduce: How to scale and avoid OutOfMemory?

Question

1 answers

solution1 1 ACCPTED 2016-07-12 10:46:30

solution1
1 ACCPTED 2016-07-12 10:46:30