简体   繁体   English

hadoop输入路径指定文件夹范围

[英]hadoop input path specify folder range

How do I specify a generic input path for my map reduce. 如何为地图简化指定通用输入路径。

Example folder structure is: 文件夹结构示例为:

   folderA/folderB/folderC/mainfolder/date/day/data files

There are many date folders and many days folders. 有许多日期文件夹和许多天文件夹。

I want to drill down within a specific range of date folders folder and then pick up specific range of data files. 我想在日期文件夹的特定范围内进行追溯,然后选择特定范围的数据文件。 If I try 如果我尝试

'folderA/folderB/folderC/mainfolder/*/*' 

This will read all files. 这将读取所有文件。 I want to specify a date forlder range ie read all files within 13-06-01 and 13-06-25 and and ignore all other date folders. 我想指定一个日期查询范围,即读取13-06-01和13-06-25中的所有文件,并忽略所有其他日期文件夹。 How do I do that? 我怎么做?

If you are providing 如果您提供

'folderA/folderB/folderC/mainfolder/*/*' 

as an input and want to filter out specific paths, you might want to create a custom PathFilter 作为输入并想要过滤掉特定路径,您可能想要创建一个自定义PathFilter

In FileInputFormat you have this function- FileInputFormat中,您具有此功能-


static void setInputPathFilter (JobConf conf, Class<? extends PathFilter> filter)
Info: Set a PathFilter to be applied to the input paths for the map-reduce job

For eg 例如

public static class CustomPathFilter implements PathFilter {
    @Override
    public boolean accept(Path path) {
        //you can implement your logic for finding the valid range of paths here.
        //The valid range of dates and days for directories and files can be input 
        //as arguments to the job.
        //Return true if you find a match or else return false.
        return false; 
    }
}

Register the PathFilter like this - 像这样注册PathFilter-

FileInputFormat.setInputPathFilter(job, CustomPathFilter.class);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM