简体   繁体   English

hadoop map中的多个目录作为输入格式减少

[英]Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop. 我正在尝试使用hadoop在分布式系统中运行图形验证程序。 I have the input in the following format: 我有以下格式的输入:

Directory1 directory1中

---file1.dot ---file1.dot

---file2.dot ---file2.dot

….. ... ..

---filen.dot ---filen.dot

Directory2 Directory2

---file1.dot ---file1.dot

---file2.dot ---file2.dot

….. ... ..

---filen.dot ---filen.dot

Directory670 Directory670

---file1.dot ---file1.dot

---file2.dot ---file2.dot

….. ... ..

---filen.dot ---filen.dot

.dot files are files storing the graphs. .dot文件是存储图形的文件。

Is it enough for me to add the input directories path using FileInputFormat.addInputPath() ? 对我而言,使用FileInputFormat.addInputPath()添加输入目录路径是否足够?

I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory. 我希望hadoop处理同一节点中每个目录的内容,因为每个目录中存在的文件都包含取决于同一目录中其他文件是否存在的数据。

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(eg directory 1 to node1 , directory 2 to node2....so on) and process in parallel? hadoop框架是否会负责将目录平均分配给群集的各个节点(例如,目录1到node1,目录2到node2 ....等等)并并行处理?

The files in each directory is dependent on each other for data(to be precise... 每个目录中的文件相互依赖的数据(准确地说是...

  • each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files, 每个目录都包含一个文件(main.dot,该文件具有非循环图,其顶点为其余文件的名称,
  • so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file. 因此,我的验证程序将遍历main.dot中存在的图形的每个顶点,在同一目录中搜索相同名称的文件,如果找到,则处理该文件中的数据。

  • similarly all the files will be processed and the combined output after processing each file in the directory is displayed, 类似地,将处理所有文件,并显示目录中每个文件处理后的组合输出,

  • same procedure goes for rest of the directories.) 其余目录使用相同的步骤。)

Cutting long story short As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word. 简而言之,就像在著名的单词计数应用程序中一样(如果输入是一本书),hadoop将拆分输入并将任务分配给群集中的每个节点,在该节点中,映射器处理每一行并计算相关单词。 How can i split the task here(do i need to split by the way?) 我如何在这里拆分任务(我是否需要拆分?)

How can i leverage hadoop power for this scenario, some sample code template will help for sure:) 在这种情况下,我如何利用hadoop的功能,一些示例代码模板肯定会有所帮助:)

The soln given by Alexey Shestakov will work. Alexey Shestakov给出的解决方案将起作用。 But it is not leveraging MapReduce's distributed processing framework. 但是它没有利用MapReduce的分布式处理框架。 Probably only one map process will read the file ( file containing paths of all input files) and then process the input data. 可能只有一个映射过程会读取该文件(包含所有输入文件路径的文件),然后处理输入数据。 How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories? 我们如何将目录中的所有文件分配给一个映射器,以使映射器的数量等于目录的数量? One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class. 一种解决方案是使用“ org.apache.hadoop.mapred.lib.MultipleInputs”类。 use MultipleInputs.addInputPath() to add the directories and map class for each directory path. 使用MultipleInputs.addInputPath()为每个目录路径添加目录和映射类。 Now each mapper can get one directory and process all files within it. 现在,每个映射器都可以获取一个目录并处理其中的所有文件。

You can create a file with list of all directories to process: 您可以创建一个包含所有要处理目录的文件:

/path/to/directory1
/path/to/directory2
/path/to/directory3

Each mapper would process one directory, for example: 每个映射器将处理一个目录,例如:

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSystem fs = FileSystem.get(context.getConfiguration());
            for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
                // process file
            }
        }

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(eg directory 1 to node1 , directory 2 to node2....so on) and process in parallel? hadoop框架是否会负责将目录平均分配给群集的各个节点(例如,目录1到node1,目录2到node2 ....等等)并并行处理?

No, it won't. 不,不会。 Files are not distributed to each node in the sense that the files are copied to the node to be processed. 从文件复制到要处理的节点的意义上说,文件未分发到每个节点。 Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. 相反,简单地说,每个节点都具有一组要处理的文件路径,而不能保证位置或数据的局部性。 The datanode then pulls that file from HDFS and processes it. 然后,datanode从HDFS中提取该文件并进行处理。

There's no reason why you can't just open other files you may need directly from HDFS. 没有理由不能只从HDFS直接打开可能需要的其他文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM