简体   繁体   English

如何在Apache Spark中重置MapReduce函数上的Iterator

[英]How to reset Iterator on MapReduce Function in Apache Spark

I'm a newbie with Apache-Spark. 我是Apache-Spark的新手。 I wanna know how to reset the pointer to Iterator in MapReduce function in Apache Spark so that I wrote 我想知道如何在Apache Spark的MapReduce函数中重置指向Iterator的指针,以便我写

Iterator<Tuple2<String,Set<String>>> iter = arg0;    

but it isn't working. 但它不起作用。 Following is a class implementing MapReduce function in java. 以下是在java中实现MapReduce函数的类。

class CountCandidates implements Serializable,
    PairFlatMapFunction<Iterator<Tuple2<String,Set<String>>>, Set<String>, Integer>,
    Function2<Integer, Integer, Integer>{

    private List<Set<String>> currentCandidatesSet;
    public CountCandidates(final List<Set<String>> currentCandidatesSet) {
        this.currentCandidatesSet = currentCandidatesSet;
    }

    @Override
    public Iterable<Tuple2<Set<String>, Integer>> call(
            Iterator<Tuple2<String, Set<String>>> arg0)
            throws Exception {
        List<Tuple2<Set<String>,Integer>> resultList = 
                new LinkedList<Tuple2<Set<String>,Integer>>();

        for(Set<String> currCandidates : currentCandidatesSet){
            Iterator<Tuple2<String,Set<String>>> iter = arg0;
            while(iter.hasNext()){
                Set<String> events = iter.next()._2;
                if(events.containsAll(currCandidates)){
                    Tuple2<Set<String>, Integer> t = 
                            new Tuple2<Set<String>, Integer>(currCandidates,1);
                    resultList.add(t);
                }
            }
        }

        return resultList;
    }

    @Override
    public Integer call(Integer arg0, Integer arg1) throws Exception {
        return arg0+arg1;
    }
}

If iterator can not be reset in the function how can I iterate the parameter arg0 several times? 如果迭代器无法在函数中重置,我怎样才能多次迭代参数arg0? I already tried some different ways as following code but it is also not working. 我已经尝试了一些不同的方式,如下面的代码,但它也无法正常工作。 The following code seems like 'resultList' has too many data than I expected. 以下代码似乎'resultList'的数据太多,超出了我的预期。

        while(arg0.hasNext()){
            Set<String> events = arg0.next()._2;
            for(Set<String> currentCandidates : currentCandidatesSet){
                if(events.containsAll(currentCandidates)){
                    Tuple2<Set<String>, Integer> t = 
                            new Tuple2<Set<String>, Integer>(currentCandidates,1);
                    resultList.add(t);
                }
            }
        }

How can I solve it? 我该如何解决?

Thanks in advance for your answer and sorry for my poor english. 提前感谢您的回答,对不起我的英语不好。 If you don't understand my question please make a comment 如果您不理解我的问题,请发表评论

An Iterator can't be 'reset' in plain Java or Scala, even. 即使是普通Java或Scala中的Iterator也无法“重置”。 That's the nature of an Iterator . 这就是Iterator的本质。 An Iterable is something that can provide you Iterator s many times. Iterable可以为Iterator提供多次。 Your code needs to be rewritten to accept an Iterable , if that's what you really want to do. 您的代码需要重写以接受Iterable ,如果这是您真正想要做的。

The hadoop iterator could theoretically be reset to the beginning if it was cloneable. 理论上,hadoop迭代器可以重置为开头,如果它是可克隆的。 Reseting to the beginning in a Mapreduce framework would be acceptable since you would still get to read the file from the beginning getting better overall speed. 在Mapreduce框架中重新开始重新设置是可以接受的,因为您仍然可以从一开始就读取文件,从而获得更好的整体速度。 Reseting the iterator to a random point would be counter to the Mapreduce mind set because it would likely require random access from a file. 将迭代器重置为随机点将与Mapreduce思维集相反,因为它可能需要从文件中随机访问。

There is a ticket in Hadoop's Jira explaining why they chose not to make the iterator cloneable although it does indicate that it is possible that it would be since the values would not have to be stored in memory. Hadoop的Jira中有一张票据解释了为什么他们选择不使迭代器可克隆,尽管它确实表明它可能是因为值不必存储在内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM