简体   繁体   English

为什么我的avro输出文件在我的养猪工作中是如此之小而如此之多?

[英]Why are my avro output files so small and so numerous in my pig job?

I'm running a pig script that does a series of joins and write using AvroStorage() 我正在运行一个执行一系列连接并使用AvroStorage()编写的Pig脚本

All is running well, and I am getting the data that I want... but it is being written to 845 avro files (~30kb each). 一切运行良好,并且我正在获取所需的数据...但是它正在写入845个Avro文件(每个〜30kb)。 This does not seem right at all... but I cannot seem to find any settings that I may have changed to go from my previous output of 1 large avro to 845 small avros (except adding another data source). 这似乎一点都不对...但是我似乎找不到任何可能更改的设置,从先前的1个大avro输出更改为845个小avro输出(添加其他数据源除外)。

Would this change anything? 这会改变什么吗? And how can I get it back to one or two files?? 以及如何将其恢复为一个或两个文件?

Thanks! 谢谢!

A possibility is to change your block size. 一种可能是更改您的块大小。 If you want to go back to less files, you can also try to use parquet. 如果要返回较少的文件,也可以尝试使用镶木地板。 Transform your .avro files through a pig script and store it like a .parquet file this will reduce your 845 to less files. 通过Pig脚本转换.avro文件并将其像.parquet文件一样存储,这将减少845到更少的文件。

But it isn't necessary to get back to less files except for a performance advantage. 但是除了性能方面的优势外,没有必要恢复到更少的文件。

The number of files written by MR job is defined by the number of reducers ran. MR作业写入的文件数由运行的减速器数定义。 You can use PARALLEL in Pig script to control the number of reducers. 您可以在Pig脚本中使用PARALLEL来控制减速器的数量。

If you are sure that the final data is small enough (comparable to your block size), you can add PARALLEL 1 to your JOIN statement to make sure that JOIN is translated to 1 reducers and thus writes output in only 1 file. 如果您确定最终数据足够小(与块大小相当),则可以将PARALLEL 1添加到JOIN语句中,以确保将JOIN转换为1个reducer,从而仅将输出写入1个文件中。

I solved that using SET pig.maxCombinedSplitSize 134217728; 我使用SET pig.maxCombinedSplitSize 134217728;解决了这一问题SET pig.maxCombinedSplitSize 134217728;

with SET default_parallel 10; SET default_parallel 10; it may still output many small files depending on the PIG job. 根据PIG作业,它可能仍会输出许多小文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM