简体   繁体   中英

Why are my avro output files so small and so numerous in my pig job?

I'm running a pig script that does a series of joins and write using AvroStorage()

All is running well, and I am getting the data that I want... but it is being written to 845 avro files (~30kb each). This does not seem right at all... but I cannot seem to find any settings that I may have changed to go from my previous output of 1 large avro to 845 small avros (except adding another data source).

Would this change anything? And how can I get it back to one or two files??

Thanks!

A possibility is to change your block size. If you want to go back to less files, you can also try to use parquet. Transform your .avro files through a pig script and store it like a .parquet file this will reduce your 845 to less files.

But it isn't necessary to get back to less files except for a performance advantage.

The number of files written by MR job is defined by the number of reducers ran. You can use PARALLEL in Pig script to control the number of reducers.

If you are sure that the final data is small enough (comparable to your block size), you can add PARALLEL 1 to your JOIN statement to make sure that JOIN is translated to 1 reducers and thus writes output in only 1 file.

I solved that using SET pig.maxCombinedSplitSize 134217728;

with SET default_parallel 10; it may still output many small files depending on the PIG job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM