简体繁体 English

Hadoop输入文件顺序

[英]Hadoop Input files Order

原文 2011-01-25 11:25:23 1 1 hadoop

I have data files arranged in folders named as dates. 我将数据文件放在名为日期的文件夹中。 Directory structure 目录结构

/data/2011/01/01 / data / 2011/01/01
/data/2011/01/02 / data / 2011/01/02

and so on and inside each directory there are around 50 files I need to parsed and I am giving input to hadoop as /data/** /** /** so that It can parse all the files. 依此类推，在每个目录中，我需要解析大约50个文件，而我给hadoop的输入是/ data / ** / ** / **，以便它可以解析所有文件。 My questions are 我的问题是

How can I ask hadoop to order the input. 我如何要求hadoop订购输入。 I need to parse the files date by date. 我需要按日期解析文件。
While parsing files of particular date, I need to pre load a datastructure associated with that date and is in the same date directory. 在解析特定日期的文件时，我需要预加载与该日期关联的数据结构，并且该数据结构位于同一日期目录中。

Thanks Ankush 谢谢安库什

1 个解决方案

You can't order the input. 您无法订购输入。 In a "worst case" scenario if you have the same number of input files as you have running tasks in a cluster they will all be processed at the same moment in parallel. 在“最坏的情况”下，如果输入文件的数量与集群中正在运行的任务的数量相同，则将同时并行处理所有输入文件。
Perhaps you can create a custom implementation of "FileInputFormat" that reads the required config file and does what you need? 也许您可以创建“ FileInputFormat”的自定义实现，该实现读取所需的配置文件，您需要什么吗？