简体   繁体   English

使用PIG处理小文件

[英]Handling small files with PIG

According to my understanding Map/Reduce works better with large files. 根据我的理解,Map / Reduce可以更好地处理大文件。 ( I understand its due to splitting logic ,etc ), we can put files as values and file name as key in the sequence files and optimize. (我理解它由于分裂逻辑等),我们可以将文件作为值和文件名作为序列文件中的关键并进行优化。

Now the issue is I am using PIG for analytics, and we have around thousands of files but all are in KB. 现在的问题是我使用PIG进行分析,我们有大约数千个文件,但都是以KB为单位。 As we know pig latin is converted and run as MR jobs, so I've a doubt that MR jobs will be in-efficient owing to small files. 我们知道猪拉丁被转换并作为MR工作运行,所以我怀疑由于文件很小,MR工作将无效。

Is there any way by which I can get some control over small files handling over pig ? 有什么方法可以控制对猪的小文件处理吗? Is there any out of the box solution? 有没有开箱即用的解决方案?

Pig具有将小文件组合成更大块的功能: http//pig.apache.org/docs/r0.11.1/perf.html#combine-files

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM