简体   繁体   English

运行本地hadoop map-reduce不会按预期对数据进行分区

[英]Running a local hadoop map-reduce does not partition data as expected

I have a map-reduce program that calculates the number of occurring bigrams from google ngrams on each decade. 我有一个map-reduce程序,可以计算每十年Google ngrams中出现的双字母组的数量。
My partitioner is: 我的分区是:

public static class PartitionerClass extends Partitioner<Bigram, IntWritable> {
    public int getPartition(Bigram key, IntWritable value, int numPartitions) {
        String combined=key.getFirst().toString()+key.getSecond().toString()+key.getDecade().toString();
        return combined.hashCode()%numPartitions;
    }
}

I have added a breakpoint but the program does not go through that piece of code. 我添加了一个断点,但是该程序没有执行该代码。
My main: 我的主要:

Configuration conf = new Configuration();
Job job = new Job(conf, "first join");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setPartitionerClass(PartitionerClass.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));    ///SHOULD BE DECIDED
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapOutputKeyClass(Bigram.class);
job.setMapOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);

The code runs not as expected, some data is processed correctly and some isn't. 代码未按预期运行,某些数据已正确处理,而某些数据未正确处理。
I really don't know how to debug this. 我真的不知道该如何调试。
Any ideas? 有任何想法吗?

The partitioner, given the number of partitions that you give, defines which key goes to which partition. 根据给定的分区数量,分区程序定义将哪个键分配给哪个分区。 Its job is not to set the number of partitions, but their contents. 它的工作不是设置分区的数量,而是设置其内容。 Each reduce tasks then handles one partition, so at the end, number of partitions = number of reduce tasks = number of output files (is using default settings and not MultipleOutputs). 然后,每个缩减任务将处理一个分区,因此最后,分区数=缩减任务数=输出文件数(正在使用默认设置,而不是MultipleOutputs)。

In order to set the number of partitions, you should use: 为了设置分区数,您应该使用:

job.setNumReduceTasks(n); , where n is the number that you want. ,其中n是您想要的数字。

For instructions (rules of thumb, nothing strict there) on how to set this number, you can read this post . 有关如何设置此数字的说明(经验法则,无严格要求),您可以阅读此文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM