简体   繁体   English

在Hadoop中将减少的数据拆分为输出和新输入

[英]Split reduced data into output and new input in Hadoop

I've been looking around for days trying to find a way using reduced data for further mapping in hadoop. 我一直在寻找使用简化数据的方法,以便在hadoop中进一步映射。 I've got objects of class A as input data and objects of class B as output data. 我有A类对象作为输入数据, B类对象作为输出数据。 The Problem is, that while mapping not only B s are generated but new A s as well. 问题是,虽然映射不仅生成了B s而且生成了新的A s。

Here's what I'd like to achieve: 这就是我想要实现的目标:

1.1 input: a list of As
1.2 map result: for each A a list of new As and a list of Bs is generated
1.3 reduce: filtered Bs are saved as output, filtered As are added to the map jobs

2.1 input: a list of As produced by the first map/reduce
2.2 map result: for each A a list of new As and a list of Bs is generated
2.3 ...

3.1 ...

You should get the basic idea. 你应该得到基本的想法。

I've read a lot about chaining but I'm not sure how to combine ChainReducer and ChainMapper or even if this would be the right approach. 我已经阅读了很多关于链接的内容,但我不确定如何将ChainReducer和ChainMapper结合起来,或者即使这是正确的方法。

So here's my question: How can I split the mapped data while reducing to save one part as output and the other part as new input data. 所以这是我的问题:如何在减少时拆分映射数据,将一部分保存为输出,另一部分保存为新输入数据。

Try using MultipleOutputs . 尝试使用MultipleOutputs As it's Javadoc suggests: 正如Javadoc建议的那样:

The MultipleOutputs class simplifies writing output data to multiple outputs MultipleOutputs类简化了将输出数据写入多个输出的过程

Case one: writing to additional outputs other than the job default output. 情况一:写入除作业默认输出之外的其他输出。 Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class. 每个附加输出或命名输出可以配置自己的OutputFormat,具有自己的密钥类和自己的值类。

Case two: to write data to different files provided by user 案例二:将数据写入用户提供的不同文件

Usage pattern for job submission: 作业提交的使用模式:

Job job = new Job();

 FileInputFormat.setInputPath(job, inDir);
 FileOutputFormat.setOutputPath(job, outDir);

 job.setMapperClass(MOMap.class);
 job.setReducerClass(MOReduce.class);
 ...

 // Defines additional single text based output 'text' for the job
 MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
 LongWritable.class, Text.class);

 // Defines additional sequence-file based output 'sequence' for the job
 MultipleOutputs.addNamedOutput(job, "seq",
   SequenceFileOutputFormat.class,
   LongWritable.class, Text.class);
 ...

 job.waitForCompletion(true);
 ...

Usage in Reducer: 减速器中的用法:

 String generateFileName(K k, V v) {
   return k.toString() + "_" + v.toString();
 }

 public class MOReduce extends
   Reducer<WritableComparable, Writable,WritableComparable, Writable> {
 private MultipleOutputs mos;
 public void setup(Context context) {
 ...
 mos = new MultipleOutputs(context);
 }

 public void reduce(WritableComparable key, Iterator<Writable> values,
 Context context)
 throws IOException {
 ...
 mos.write("text", , key, new Text("Hello"));
 mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
 mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
 mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
 ...
 }

 public void cleanup(Context) throws IOException {
 mos.close();
 ...
 }

 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM