简体繁体中英

Why are my avro output files so small and so numerous in my pig job?

原文 2014-08-12 02:44:11 3 3 hadoop/ apache-pig/ avro

I'm running a pig script that does a series of joins and write using AvroStorage()

All is running well, and I am getting the data that I want... but it is being written to 845 avro files (~30kb each). This does not seem right at all... but I cannot seem to find any settings that I may have changed to go from my previous output of 1 large avro to 845 small avros (except adding another data source).

Would this change anything? And how can I get it back to one or two files??

Thanks!

3 answers

A possibility is to change your block size. If you want to go back to less files, you can also try to use parquet. Transform your .avro files through a pig script and store it like a .parquet file this will reduce your 845 to less files.

But it isn't necessary to get back to less files except for a performance advantage.

The number of files written by MR job is defined by the number of reducers ran. You can use PARALLEL in Pig script to control the number of reducers.

If you are sure that the final data is small enough (comparable to your block size), you can add PARALLEL 1 to your JOIN statement to make sure that JOIN is translated to 1 reducers and thus writes output in only 1 file.

I solved that using SET pig.maxCombinedSplitSize 134217728;

with SET default_parallel 10; it may still output many small files depending on the PIG job.

Why are Pig's job jars so big

Why so many tasks in my spark job? Getting 200 Tasks By Default

Handling small files with PIG

Why is counting items in pig so much slower than hive

Why my simple Spark application works so slow?

Why my tasks does not run in parallel in Pig?

How can I configure hadoop mapreduce so that the log of my mapreduce class can output to a file?

Why submitting job to mapreduce takes so much time in General?

Why job with mappers only is so slow in real cluster?

New to pig, how can i know what underlying mapreduce jobs run for my pig script ? how to debug result output of pig script?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Why are Pig's job jars so big Why so many tasks in my spark job? Getting 200 Tasks By Default Handling small files with PIG Why is counting items in pig so much slower than hive Why my simple Spark application works so slow? Why my tasks does not run in parallel in Pig? How can I configure hadoop mapreduce so that the log of my mapreduce class can output to a file? Why submitting job to mapreduce takes so much time in General? Why job with mappers only is so slow in real cluster? New to pig, how can i know what underlying mapreduce jobs run for my pig script ? how to debug result output of pig script?

Related Tags

Why are my avro output files so small and so numerous in my pig job?

Question

3 answers

solution1
0 2014-08-12 10:58:41

solution2
0 2014-08-12 16:40:08

solution3
0 2015-07-24 09:58:30

Why are my avro output files so small and so numerous in my pig job?

Question

3 answers

solution1 0 2014-08-12 10:58:41

solution2 0 2014-08-12 16:40:08

solution3 0 2015-07-24 09:58:30

solution1
0 2014-08-12 10:58:41

solution2
0 2014-08-12 16:40:08

solution3
0 2015-07-24 09:58:30