简体   繁体   English

如何使Hadoop MR只读取文件而不是输入路径中的文件夹

[英]How to make Hadoop MR to read only files instead of folders in input path

As per our requirement, the output of one job will be the input of other job. 根据我们的要求,一份工作的输出将是其他工作的输入。

By using Multiple outputs concepts we are creating a new folder in output path and writing those records into folder. 通过使用多输出概念,我们在输出路径中创建一个新文件夹并将这些记录写入文件夹。 This is how it looks like : 这是它的样子:

OPFolder1/MultipleOP/SplRecords-m-0000*
OPFolder1/part-m-0000* files

When the new job is using the input as OPFolder1, I am facing the below error 当新作业将输入用作OPFolder1时,我面临以下错误

org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/abhime01/OPFolder1/MultiplOP/

Is there any way or property, to make hadoop, read only the files rather than folders. 是否有任何方法或属性,使hadoop,只读文件而不是文件夹。

Set mapreduce.input.fileinputformat.input.dir.recursive to true . mapreduce.input.fileinputformat.input.dir.recursive设置为true See FileInputFormat doesn't read files recursively in the input path dir . 请参阅FileInputFormat不会在输入路径dir中递归读取文件

One way to achieve this is to create custom input format by subclassing default InputFormat class, so that it will allow you to override the listStatus method. 实现此目的的一种方法是通过继承默认的InputFormat类来创建自定义输入格式,以便它允许您覆盖listStatus方法。 While implement the liststatus method you just need to ignore directories inside your input dir. 在实现liststatus方法时,您只需忽略输入目录中的目录。

Example: 例:

 for (int i = 0; i < len; ++i) {
FileStatus file = files[i];
if (!file.isDir()) {
newFiles.add(file);

Hope that will help you. 希望能帮到你。

您可以使用路径: OPFolder1/part-m* ,而不是使用InputPath的根目录,该路径基本上是此目录中的所有文件,其名称以part-m开头。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM