How can I iterate through a hadoop reduce values iterable more than once without caching in hadoop 1.0.3?

Question

I have a problem where I basically want to do something like this:

    public void reduce(Text key, Iterable<Text> iterValues, Context context){

           for (Text val : iterValues){
               //do something
           }

           iterValues.reset()
           for (Text val : iterValues){
               //do something else
           }
}

I know it's best to avoid these situations, or to simply instantiate objects in memory, but I've got a problem where it's possible I would have too many things to keep in memory and it would get much more complicated structurally to break this up into more reduce steps.

It seems I'm not alone in looking for this functionality, in fact it looks like this is a feature that was implemented a while ago: https://issues.apache.org/jira/browse/HADOOP-5266

The MarkableIterator class seems to be exactly what I'm looking for: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/MarkableIterator.html

However it seems that it's only available in hadoop 2.0.3-alpha. I'm looking to run this in EMR which only supports 1.0.3 (what I'm currently using) or 0.20.205. I've been trying various things, but I haven't found anything in 1.0.3 that gives me a similar functionality. The closest I've come is by using a StreamBackedIterator, which still accumulates objects in memory but appears to be more memory efficient than an ArrayList.

Is anyone aware of a way to do this in Hadoop 1.0.3?

Answer 1

This is a bit of a hack, but you could have your Mapper emit every value twice, but with some flag set in one and not the other. And then order the values based first on that flag, then on whatever natural ordering you want. Then you'll have to do some custom logic to stop the first loop once you hit the second set of values.

Other than that, no, I don't see an easy way of doing this without simply storing them yourself in memory. The main problem is that the iterator doesn't actually return new objects, it returns the same object but mutates between calls to next() . Behind the scenes, Hadoop may not even cache the whole set of values, so resetting the iterator would necessitate re-scanning a file (which I'm guessing they're doing in the new version).

How can I iterate through a hadoop reduce values iterable more than once without caching in hadoop 1.0.3?

Question

1 answers

solution1
1 2013-03-07 01:25:55

How can I iterate through a hadoop reduce values iterable more than once without caching in hadoop 1.0.3?

Question

1 answers

solution1 1 2013-03-07 01:25:55

solution1
1 2013-03-07 01:25:55