简体   繁体   English

Hadoop Mapreduce:Reducer的值是相反的顺序

[英]Hadoop Mapreduce : values to reducer are in reverse order

I will be doing the following in a much bigger file. 我将在更大的文件中进行以下操作。 for now,I have an example input file with the following values. 现在,我有一个带有以下值的示例输入文件。

1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN

now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. 现在我正在执行的是mapreduce作业,以加密每个记录的姓氏并写回输出。 and i am using the mapper partition number as the key and the modified text as value. 我正在使用映射器分区号作为键,修改后的文本作为值。

so the output from mapper will be, 所以mapper的输出是

0   1000,Mj4oJyk=,,JERRY
0   1001,KzwpPQ,TIA
0   1002,NSQgOi8,MARK
0   1003,KTIzNzg,DENNIS
0   1004,IjsoPyU,JACK
0   1005,IjsoPyU,NORTON
0   1006,JTI3OjI,JENNY
0   1007,JTI3OjI,KAREN
0   1008,LDoqNg,JOHN
0   1009,JTYvPSgg,SHERIN

I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. 我不希望进行任何排序。我也使用reducer,因为如果文件较大,将有多个映射器,如果没有reducer,将写入多个输出文件。 so i use a single reduce to merge values from all mappers and write to single file. 因此,我使用单个reduce来合并所有映射器的值并写入单个文件。 now the input values to reducer comes in reversed order and in the order from mapper. 现在,reducer的输入值的顺序与映射器的顺序相反。 it is like the following, 就像下面这样

1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY

Why is it reversing the order? 为什么要冲销订单? and how can i maintain the same order from mapper? 我如何从mapper维护相同的订单? any suggestions will be helpfull 任何建议都会有所帮助

EDIT 1 : 编辑1:

the Driver code is, 驱动程序代码是

Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
    job.setJobName("encrypt");
    job.setJarByClass(TestDriver.class);
    job.setMapperClass(TestMap.class);
    job.setNumReduceTasks(1);
    job.setReducerClass(TestReduce.class);
    job.setMapOutputKeyClass(IntWritable.class);
    job.setMapOutputValueClass(Text.class);
     job.setOutputKeyClass(Text.class);
     job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
    FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);

the mapper code is, 映射器代码是

        inputValues = value.toString().split(",");
        stringBuilder = new StringBuilder();
        TaskID taskId = context.getTaskAttemptID().getTaskID();
        int partition = taskId.getId();

 // the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
        mask(inputvalues);
        context.write(new IntWritable(partition), new Text(stringBuilder.toString()));

The reducer code is, 减速器代码是,

       for(Text value : values) {
        context.write(new Text(value), null);
       }

The base idea of MapReduce is that the order in which things are done is irrelevant. MapReduce的基本思想是完成任务的顺序无关紧要。 So you cannot (and do not need to) control the order in which 因此,您不能(也不需要)控制顺序

  • the input records go through the mappers. 输入记录通过映射器。
  • the key and related values go through the reducers. 关键和相关的价值通过减速器。

The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer. 您唯一可以控制的是将值放置在reducer中可用的迭代器中的顺序。

For that you can use the Object key to maintain the order of values. 为此,您可以使用Object key保持值的顺序。 The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file). LongWritable部分(或键)是该行在文件中的位置(不是行号,而是从文件开头开始的位置)。 You can use that part to keep track of which line was first. 您可以使用该部分来跟踪第一行。

Then your mapper code will be changed to 然后,您的映射器代码将更改为

protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
        throws IOException, InterruptedException {
    inputValues = value.toString().split(",");
    stringBuilder = new StringBuilder();
    mask(inputValues);
    // the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
    context.write(new LongWritable(((LongWritable) key).get()), value);

}

Note: you can change all IntWritable to LongWritable in your code but be careful. 注意:您可以在代码LongWritable所有IntWritable更改为LongWritable ,但要小心。

    inputValues = value.toString().split(",");
    stringBuilder = new StringBuilder();
    TaskID taskId = context.getTaskAttemptID().getTaskID();
    //preserve the number value for sorting
    IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])

    // the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
    mask(inputvalues);
    context.write(idNumber, new Text(stringBuilder.toString()));

I made some assumptions because you did not have the full code of the mapper. 我做了一些假设,因为您没有Mapper的完整代码。 I assumed that inputValues was a string array due to the toString() output. 由于toString()输出,我假设inputValues是一个字符串数组。 The first value of the array should be the number value from your input, however it is now a string. 数组的第一个值应该是您输入的数字值,但是现在它是一个字符串。 You must convert the number back to IntWritable to match what your mapper is emitting IntWritable,Text . 您必须将数字转换回IntWritable以匹配您的映射器发出的IntWritable,Text The hadoop framework will sort by key and with the key being of type IntWritable it will sort in ascending order. hadoop框架将按键排序,并且键的类型为IntWritable ,它将按升序排序。 The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. 您提供的代码使用任务ID,并且通过阅读API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID()不清楚是否会按照您的期望为您的值提供顺序。 To control the order of output I would recommend using the first value of your string array and convert to IntWritable. 为了控制输出顺序,我建议您使用字符串数组的第一个值并将其转换为IntWritable。 I don't know if this violates your intent to mask the inputValues . 我不知道这是否违反了掩盖inputValues意图。

EDIT 编辑

To follow up with your comment. 跟进您的评论。 You can simply multiply the partition by -1 this will cause the hadoop framework to reverse the order. 您可以简单地将partition乘以-1这将导致hadoop框架颠倒顺序。

int partition = -1*taskId.getId();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM