在Hadoop Map Reduce中重命名部件文件

Question

我已嘗試按照頁面http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/中的示例使用MultipleOutputs類MultipleOutputs.html

驅動程序代碼

    Configuration conf = new Configuration();
    Job job = new Job(conf, "Wordcount");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
            Text.class, IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);

減速機代碼

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    private MultipleOutputs<Text, IntWritable> mos;
    public void setup(Context context){
        mos = new MultipleOutputs<Text, IntWritable>(context);
    }
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        //context.write(key, result);
        mos.write("text", key,result);
    }
    public void cleanup(Context context)  {
         try {
            mos.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
         }
}

發現reducer的輸出重命名為text-r-00000

但這里的問題是我也得到一個空的part-r-00000文件。 這是預期MultipleOutputs的行為，還是我的代碼有問題？ 請指教。

我嘗試過的另一個替代方法是使用FileSystem類迭代我的輸出文件夾，並手動重命名以part開頭的所有文件。

什么是最好的方法？

FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
} 
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}

Answer 1

即使您使用的是MultipleOutputs ，默認的OutputFormat （我相信它是TextOutputFormat ）仍在使用，因此它將初始化並創建您看到的這些part-r-xxxxx文件。

它們是空的這一事實是因為您沒有使用任何context.write因為您正在使用MultipleOutputs 。 但這並不妨礙在初始化期間創建它們。

要擺脫它們，您需要定義OutputFormat以表示您不期望任何輸出。 你可以這樣做：

job.setOutputFormat(NullOutputFormat.class);

使用該屬性集，這應該確保您的零件文件根本不會被初始化，但您仍然可以在MultipleOutputs獲得輸出。

您也可以使用LazyOutputFormat ，這將確保僅在/如果有某些數據時創建輸出文件，而不是初始化空文件。 你可以這樣做：

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

請注意，您在Reducer中使用了原型MultipleOutputs.write(String namedOutput, K key, V value) ， MultipleOutputs.write(String namedOutput, K key, V value)使用將根據您的namedOutput生成的默認輸出路徑，如： {namedOutput}-(m|r)-{part-number} 。如果要對輸出文件名進行更多控制，則應使用原型MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) ，這樣可以根據鍵/值獲取在運行時生成的文件名。

Answer 2

這是您在Driver類中需要做的就是更改輸出文件的基本名稱： job.getConfiguration().set("mapreduce.output.basename", "text"); 因此，這將導致您的文件被稱為“text-r-00000”。

在Hadoop Map Reduce中重命名部件文件

問題描述

2 個解決方案

解決方案1
21 已采納 2013-01-28 04:39:07

解決方案2
11 2015-02-03 12:08:55

在Hadoop Map Reduce中重命名部件文件

問題描述

2 個解決方案

解決方案1 21 已采納 2013-01-28 04:39:07

解決方案2 11 2015-02-03 12:08:55

解決方案1
21 已采納 2013-01-28 04:39:07

解決方案2
11 2015-02-03 12:08:55