简体   繁体   English

Hadoop输入文件

[英]Hadoop Input Files

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop? 运行hadoop时,在输入文件夹中包含n行1行的文件与在输入文件夹中包含n行1个文件之间有区别吗?

If there are n files, does the "InputFormat" just see it all as 1 continuous file? 如果有n个文件,“ InputFormat”是否仅将其视为1个连续文件?

There's a big difference. 有很大的不同。 It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks. 它通常被称为“小文件问题”,与Hadoop希望将巨大的输入拆分为较小的任务,而不是将较小的输入收集为较大的任务有关。

Take a look at this blog post from Cloudera: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ 看看Cloudera的这篇博客文章: http ://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

If you can avoid creating lots of files, do so. 如果可以避免创建大量文件,请这样做。 Concatenate when possible. 尽可能串联。 Large splittable files are MUCH better for Hadoop. 大型可拆分文件更适合Hadoop。

I once ran Pig on the netflix dataset. 我曾经在netflix数据集上运行过Pig。 It took hours to process just a few gigs. 处理几个演出只花了几个小时。 I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes. 然后,我将输入文件(我认为这是每个电影的文件,或每个用户的文件)串联为一个文件-我的结果在几分钟之内。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM