简体   繁体   English

Hadoop输入文件顺序

[英]Hadoop Input files Order

I have data files arranged in folders named as dates. 我将数据文件放在名为日期的文件夹中。 Directory structure 目录结构

  • /data/2011/01/01 / data / 2011/01/01
  • /data/2011/01/02 / data / 2011/01/02

and so on and inside each directory there are around 50 files I need to parsed and I am giving input to hadoop as /data/** /** /** so that It can parse all the files. 依此类推,在每个目录中,我需要解析大约50个文件,而我给hadoop的输入是/ data / ** / ** / **,以便它可以解析所有文件。 My questions are 我的问题是

  1. How can I ask hadoop to order the input. 我如何要求hadoop订购输入。 I need to parse the files date by date. 我需要按日期解析文件。
  2. While parsing files of particular date, I need to pre load a datastructure associated with that date and is in the same date directory. 在解析特定日期的文件时,我需要预加载与该日期关联的数据结构,并且该数据结构位于同一日期目录中。

Thanks Ankush 谢谢安库什

  1. You can't order the input. 您无法订购输入。 In a "worst case" scenario if you have the same number of input files as you have running tasks in a cluster they will all be processed at the same moment in parallel. 在“最坏的情况”下,如果输入文件的数量与集群中正在运行的任务的数量相同,则将同时并行处理所有输入文件。
  2. Perhaps you can create a custom implementation of "FileInputFormat" that reads the required config file and does what you need? 也许您可以创建“ FileInputFormat”的自定义实现,该实现读取所需的配置文件,您需要什么吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM