简体   繁体   English

使用Java进行Hadoop多节点编程

[英]Hadoop Multi-node programming with Java

I am new in Hadoop, I want to use 2, 4, 6 nodes for each run to split the dataset to be sent to the mappers. 我是Hadoop的新手,我想每次运行都使用2、4、6个节点来拆分要发送给映射器的数据集。 but the code that I have written does not work properly. 但是我编写的代码无法正常运行。 In fact it works for 2 nodes but as the number of nodes increase some output data lost in the output file. 实际上,它适用于2个节点,但是随着节点数的增加,输出文件中的某些输出数据会丢失。 Would you please help me? 你能帮我吗? Thank you 谢谢

Here is the code: 这是代码:

public static void main(String[] args) throws Exception {


        System.out.println("MapReduce Started at:"+System.currentTimeMillis());
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        int numOfNodes = 2;  

        Job job = new Job(conf, "calculateSAAM"); 
        job.setJarByClass(calculateSAAM.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path("/home/helsene/wordcount/input"));
        String outputFile = "/home/helsene/wordcount/output/";

        long dataLength = fs.getContentSummary(new Path(outputFile)).getLength();
        FileInputFormat.setMaxInputSplitSize(job, (dataLength / numOfNodes));

        job.setNumReduceTasks(numOfNodes/2);
        Path outPath = new Path(outputFile);

        fs.delete(outPath, true);
        FileOutputFormat.setOutputPath(job, new Path(outputFile)); 

        job.waitForCompletion(true);
        System.out.println("MapReduce ends at:"+System.currentTimeMillis());
        }        
    }

Each reducer produces one output file, named by default part-xxxxx ( part-00000 for the first reducer, part-00001 for the second reducer etc.). 每个减速器产生一个输出文件,默认情况下命名为part-xxxxx (第一个减速器为part-00000 ,第二个减速器为part-00001等)。

With your code, when you have more than 3 nodes, you will have more than one reducers, so the output data will be split into parts (more than one files). 对于您的代码,当您有3个以上的节点时,您将拥有不只一个化简器,因此输出数据将被拆分为多个部分(不止一个文件)。 This means that some word counts will be in the first file (part-00000), some word counts will be in the second file (part-00001), etc. You can later merge these parts by calling the getmerge command, like: 这意味着某些单词计数将在第一个文件(part-00000)中,某些单词计数将在第二个文件(part-00001)中,依此类推。稍后,您可以通过调用getmerge命令来合并这些部分,例如:

hadoop dfs -getmerge /HADOOP/OUTPUT/PATH /local/path/

and get one file in your specified local path with the merged results of all the partial files. 并在指定的本地路径中获取一个文件,其中包含所有部分文件的合并结果。 This file will have the same results as the file that you get when you have two nodes and hence 2/2 = 1 reducer (producing one output file). 该文件的结果与当您有两个节点并因此具有2/2 = 1 reducer(产生一个输出文件)时所获得的文件相同。

By the way, setting the number of reducers to numOfNodes/2 may not be the best option. 顺便说一句,将reducer的数量设置为numOfNodes/2可能不是最佳选择。 See this post for more details. 有关更多详细信息,请参见这篇文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM