在Hadoop映射器中获取总输入路径数

Question

We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. 我们正在尝试获取MapReduce程序在映射器中迭代通过的输入路径的总数。 We are going to use this along with a counter to format our value depending on the index. 我们将使用它和一个计数器来根据索引格式化我们的值。 Is there an easy way to pull the total input path count from the mapper? 有没有简单的方法可以从映射器中提取总输入路径数？ Thanks in advance. 提前致谢。

Answer 1

You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths. 您可以查看FileInputFormat.getSplits()的源代码-这会拉回mapred.input.dir的配置属性，然后将此CSV解析为路径数组。

These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext) . 这些路径仍然可以代表文件夹和正则表达式，因此getSplits（）的下一步是将数组传递给受保护的方法org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext) 。 This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured). 这实际上遍历了dirs / regex列出的内容，并列出了目录/ regex匹配文件（如果已配置，还会调用PathFilter ）。

So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method: 因此，在保护此方法的情况下，您可以创建一个简单的FileInputFormat'dummy'扩展，该扩展具有listStatus方法，接受Mapper.Context作为其参数，然后包装对FileInputFormat.listStatus方法的调用：

public class DummyFileInputFormat extends FileInputFormat {
    public List<FileStatus> listStatus(Context mapContext) throws IOException {
        return super.listStatus(mapContext);
    }

    @Override
    public RecordReader createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        // dummy input format, so this will never be called
        return null;
    }
}

EDIT : In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203) 编辑：实际上，看起来FileInputFormat已经为您完成了此任务，在getSplits（）方法的末尾配置了作业属性mapreduce.input.num.files （至少在1.0.2中，可能是在0.20.203中引入）

Here's the JIRA ticket 这是JIRA的票

Answer 2

you can setup a configuration in your job with the number of input paths. 您可以使用输入路径数在作业中设置配置。 just like 就像

jobConf.setInt("numberOfPaths",paths.length);

just put the code in that place where you configure your job. 只需将代码放在配置作业的位置即可。 After that read it out of the configuration in your Mapper.setup(Mapper.Context context) by getting it from the context. 之后，通过从上下文中获取它，从Mapper.setup(Mapper.Context context)的配置中读取它。

在Hadoop映射器中获取总输入路径数

问题描述

2 个解决方案

解决方案1
0 已采纳 2012-05-14 15:13:20

解决方案2
0 2012-05-14 15:20:10

在Hadoop映射器中获取总输入路径数

问题描述

2 个解决方案

解决方案1 0 已采纳 2012-05-14 15:13:20

解决方案2 0 2012-05-14 15:20:10

解决方案1
0 已采纳 2012-05-14 15:13:20

解决方案2
0 2012-05-14 15:20:10