在Hadoop中将减少的数据拆分为输出和新输入

Question

I've been looking around for days trying to find a way using reduced data for further mapping in hadoop. 我一直在寻找使用简化数据的方法，以便在hadoop中进一步映射。 I've got objects of class A as input data and objects of class B as output data. 我有A类对象作为输入数据， B类对象作为输出数据。 The Problem is, that while mapping not only B s are generated but new A s as well. 问题是，虽然映射不仅生成了B s而且生成了新的A s。

Here's what I'd like to achieve: 这就是我想要实现的目标：

1.1 input: a list of As
1.2 map result: for each A a list of new As and a list of Bs is generated
1.3 reduce: filtered Bs are saved as output, filtered As are added to the map jobs

2.1 input: a list of As produced by the first map/reduce
2.2 map result: for each A a list of new As and a list of Bs is generated
2.3 ...

3.1 ...

You should get the basic idea. 你应该得到基本的想法。

I've read a lot about chaining but I'm not sure how to combine ChainReducer and ChainMapper or even if this would be the right approach. 我已经阅读了很多关于链接的内容，但我不确定如何将ChainReducer和ChainMapper结合起来，或者即使这是正确的方法。

So here's my question: How can I split the mapped data while reducing to save one part as output and the other part as new input data. 所以这是我的问题：如何在减少时拆分映射数据，将一部分保存为输出，另一部分保存为新输入数据。

Answer 1

Try using MultipleOutputs . 尝试使用MultipleOutputs 。 As it's Javadoc suggests: 正如Javadoc建议的那样：

The MultipleOutputs class simplifies writing output data to multiple outputs MultipleOutputs类简化了将输出数据写入多个输出的过程

Case one: writing to additional outputs other than the job default output. 情况一：写入除作业默认输出之外的其他输出。 Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class. 每个附加输出或命名输出可以配置自己的OutputFormat，具有自己的密钥类和自己的值类。

Case two: to write data to different files provided by user 案例二：将数据写入用户提供的不同文件

Usage pattern for job submission: 作业提交的使用模式：

Job job = new Job();

 FileInputFormat.setInputPath(job, inDir);
 FileOutputFormat.setOutputPath(job, outDir);

 job.setMapperClass(MOMap.class);
 job.setReducerClass(MOReduce.class);
 ...

 // Defines additional single text based output 'text' for the job
 MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
 LongWritable.class, Text.class);

 // Defines additional sequence-file based output 'sequence' for the job
 MultipleOutputs.addNamedOutput(job, "seq",
   SequenceFileOutputFormat.class,
   LongWritable.class, Text.class);
 ...

 job.waitForCompletion(true);
 ...

Usage in Reducer: 减速器中的用法：

 String generateFileName(K k, V v) {
   return k.toString() + "_" + v.toString();
 }

 public class MOReduce extends
   Reducer<WritableComparable, Writable,WritableComparable, Writable> {
 private MultipleOutputs mos;
 public void setup(Context context) {
 ...
 mos = new MultipleOutputs(context);
 }

 public void reduce(WritableComparable key, Iterator<Writable> values,
 Context context)
 throws IOException {
 ...
 mos.write("text", , key, new Text("Hello"));
 mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
 mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
 mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
 ...
 }

 public void cleanup(Context) throws IOException {
 mos.close();
 ...
 }

 }

在Hadoop中将减少的数据拆分为输出和新输入

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-01-13 18:48:47

在Hadoop中将减少的数据拆分为输出和新输入

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-01-13 18:48:47

解决方案1
2 已采纳 2013-01-13 18:48:47