[英]Hadoop Mapreduce : values to reducer are in reverse order
I will be doing the following in a much bigger file. 我将在更大的文件中进行以下操作。 for now,I have an example input file with the following values.
现在,我有一个带有以下值的示例输入文件。
1000,SMITH,JERRY
1001,JOHN,TIA
1002,TWAIN,MARK
1003,HARDY,DENNIS
1004,CHILD,JACK
1005,CHILD,NORTON
1006,DAVIS,JENNY
1007,DAVIS,KAREN
1008,MIKE,JOHN
1009,DENNIS,SHERIN
now what i am doing is running a mapreduce job to encrypt the last name of each record and write back an output. 现在我正在执行的是mapreduce作业,以加密每个记录的姓氏并写回输出。 and i am using the mapper partition number as the key and the modified text as value.
我正在使用映射器分区号作为键,修改后的文本作为值。
so the output from mapper will be, 所以mapper的输出是
0 1000,Mj4oJyk=,,JERRY
0 1001,KzwpPQ,TIA
0 1002,NSQgOi8,MARK
0 1003,KTIzNzg,DENNIS
0 1004,IjsoPyU,JACK
0 1005,IjsoPyU,NORTON
0 1006,JTI3OjI,JENNY
0 1007,JTI3OjI,KAREN
0 1008,LDoqNg,JOHN
0 1009,JTYvPSgg,SHERIN
I don't want any sorting to be done.I also use a reducer because, in case of a larger file, there will be multiple mappers and if no reducer, multiple output files will be written. 我不希望进行任何排序。我也使用reducer,因为如果文件较大,将有多个映射器,如果没有reducer,将写入多个输出文件。 so i use a single reduce to merge values from all mappers and write to single file.
因此,我使用单个reduce来合并所有映射器的值并写入单个文件。 now the input values to reducer comes in reversed order and in the order from mapper.
现在,reducer的输入值的顺序与映射器的顺序相反。 it is like the following,
就像下面这样
1009,JTYvPSgg,SHERIN
1008,LDoqNg==,JOHN
1007,JTI3OjI=,KAREN
1006,JTI3OjI=,JENNY
1005,IjsoPyU=,NORTON
1004,IjsoPyU=,JACK
1003,KTIzNzg=,DENNIS
1002,NSQgOi8=,MARK
1001,KzwpPQ==,TIA
1000,Mj4oJyk=,JERRY
Why is it reversing the order? 为什么要冲销订单? and how can i maintain the same order from mapper?
我如何从mapper维护相同的订单? any suggestions will be helpfull
任何建议都会有所帮助
EDIT 1 : 编辑1:
the Driver code is, 驱动程序代码是
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJobName("encrypt");
job.setJarByClass(TestDriver.class);
job.setMapperClass(TestMap.class);
job.setNumReduceTasks(1);
job.setReducerClass(TestReduce.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(hdfsInputPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsOutputPath));
System.exit(job.waitForCompletion(true) ? 0 : 1);
the mapper code is, 映射器代码是
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(new IntWritable(partition), new Text(stringBuilder.toString()));
The reducer code is, 减速器代码是,
for(Text value : values) {
context.write(new Text(value), null);
}
The base idea of MapReduce is that the order in which things are done is irrelevant. MapReduce的基本思想是完成任务的顺序无关紧要。 So you cannot (and do not need to) control the order in which
因此,您不能(也不需要)控制顺序
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer. 您唯一可以控制的是将值放置在reducer中可用的迭代器中的顺序。
For that you can use the Object key
to maintain the order of values. 为此,您可以使用
Object key
保持值的顺序。 The LongWritable part (or the key) is the position of the line in the file (Not line number, but position from start of file). LongWritable部分(或键)是该行在文件中的位置(不是行号,而是从文件开头开始的位置)。 You can use that part to keep track of which line was first.
您可以使用该部分来跟踪第一行。
Then your mapper code will be changed to 然后,您的映射器代码将更改为
protected void map(Object key, Text value, Mapper<Object, Text, LongWritable, Text>.Context context)
throws IOException, InterruptedException {
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
mask(inputValues);
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
context.write(new LongWritable(((LongWritable) key).get()), value);
}
Note: you can change all IntWritable
to LongWritable
in your code but be careful. 注意:您可以在代码
LongWritable
所有IntWritable
更改为LongWritable
,但要小心。
inputValues = value.toString().split(",");
stringBuilder = new StringBuilder();
TaskID taskId = context.getTaskAttemptID().getTaskID();
//preserve the number value for sorting
IntWritable idNumber = new IntWritable(Integer.parseInt(inputValue[0])
// the mask(inputvalue) method is called to encrypt input values and write to stringbuilder in appropriate format
mask(inputvalues);
context.write(idNumber, new Text(stringBuilder.toString()));
I made some assumptions because you did not have the full code of the mapper. 我做了一些假设,因为您没有Mapper的完整代码。 I assumed that
inputValues
was a string array due to the toString()
output. 由于
toString()
输出,我假设inputValues
是一个字符串数组。 The first value of the array should be the number value from your input, however it is now a string. 数组的第一个值应该是您输入的数字值,但是现在它是一个字符串。 You must convert the number back to
IntWritable
to match what your mapper is emitting IntWritable,Text
. 您必须将数字转换回
IntWritable
以匹配您的映射器发出的IntWritable,Text
。 The hadoop framework will sort by key and with the key being of type IntWritable
it will sort in ascending order. hadoop框架将按键排序,并且键的类型为
IntWritable
,它将按升序排序。 The code you provided is using the task ID and from reading the API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID() It was unclear whether this would provide an order to your values as you desired. 您提供的代码使用任务ID,并且通过阅读API https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/TaskAttemptID.html#getTaskID()不清楚是否会按照您的期望为您的值提供顺序。 To control the order of output I would recommend using the first value of your string array and convert to IntWritable.
为了控制输出顺序,我建议您使用字符串数组的第一个值并将其转换为IntWritable。 I don't know if this violates your intent to mask the
inputValues
. 我不知道这是否违反了掩盖
inputValues
意图。
EDIT 编辑
To follow up with your comment. 跟进您的评论。 You can simply multiply the
partition
by -1
this will cause the hadoop framework to reverse the order. 您可以简单地将
partition
乘以-1
这将导致hadoop框架颠倒顺序。
int partition = -1*taskId.getId();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.