简体   繁体   English

Hadoop:Mapper无法从多个输入路径读取文件

[英]Hadoop: the Mapper didn't read files from multiple input paths

The Mapper didn't manage to read a file from multiple directories. 映射器无法从多个目录读取文件。 Could anyone help? 有人可以帮忙吗? I need to read one file in each mapper. 我需要在每个映射器中读取一个文件。 I've added multiple input paths and implemented the custom WholeFileInputFormat, WholeFileRecordReader. 我添加了多个输入路径,并实现了自定义的WholeFileInputFormat,WholeFileRecordReader。 In the map method, I don't need the input key. 在map方法中,我不需要输入键。 I make sure that each map can read a whole file. 我确保每个地图都可以读取整个文件。

Command line: hadoop jar AutoProduce.jar Autoproduce /input_a /input_b /output I specified two input path----1.input_a; 命令行:hadoop jar AutoProduce.jar自动生成/ input_a / input_b / output我指定了两个输入路径——1.input_a; 2.input_b; 2.input_b;

Run method snippets: 运行方法片段:

Job job = new Job(getConf());
job.setInputFormatClass(WholeFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]), new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));

map method snippets: 地图方法片段:

public void map(NullWritable key, BytesWritable value, Context context){
    FileSplit fileSplit = (FileSplit) context.getInputSplit();
    System.out.println("Directory :" + fileSplit.getPath().toString());
    ......
}

Custom WholeFileInputFormat: 自定义WholeFileInputFormat:

class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
    }

    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException,
        InterruptedException {

        WholeFileRecordReader reader = new WholeFileRecordReader();
        reader.initialize(split, context);
        return reader;
    }
}

Custom WholeFileRecordReader: 自定义WholeFileRecordReader:

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
    private FileSplit fileSplit;
    private Configuration conf;
    private BytesWritable value = new BytesWritable();
    private boolean processed = false;

    @Override
    public void initialize(InputSplit split, TaskAttemptContext context)
    throws IOException, InterruptedException {
        this.fileSplit = (FileSplit) split;
        this.conf = context.getConfiguration();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (!processed) {

            byte[] contents = new byte[(int) fileSplit.getLength()];
            Path file = fileSplit.getPath();
            FileSystem fs = file.getFileSystem(conf);
            FSDataInputStream in = null;
            try {
                in = fs.open(file);
                IOUtils.readFully(in, contents, 0, contents.length);
                value.set(contents, 0, contents.length);
            } finally {
                IOUtils.closeStream(in);
            }
            processed = true;
            return true;
        }
        return false;
    }
    @Override
    public NullWritable getCurrentKey() throws IOException,InterruptedException {
        return NullWritable.get();
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException,InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException {
        return processed ? 1.0f : 0.0f;
    }

    @Override
    public void close() throws IOException {
        // do nothing
    }
}

PROBLEM: 问题:

After setting two input paths, all map tasks read files from only one directory.. 设置两个输入路径后,所有地图任务仅从一个目录读取文件。

Thanks in advance. 提前致谢。

You'll have to use MultipleInputs instead of FileInputFormat in the driver. 您必须在驱动程序中使用MultipleInputs而不是FileInputFormat So your code should be as: 因此,您的代码应为:

MultipleInputs.addInputPath(job, new Path(args[0]), <Input_Format_Class_1>);
MultipleInputs.addInputPath(job, new Path(args[1]), <Input_Format_Class_2>);
.
.
.
MultipleInputs.addInputPath(job, new Path(args[N-1]), <Input_Format_Class_N>);

So if you want to use WholeFileInputFormat for the first input path and TextInputFormat for the second input path, you'll have to use it the following way: 因此,如果要对第一个输入路径使用WholeFileInputFormat ,对第二个输入路径使用TextInputFormat ,则必须按以下方式使用它:

MultipleInputs.addInputPath(job, new Path(args[0]), WholeFileInputFormat.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class);

Hope this works for you! 希望这对您有用!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM