简体   繁体   中英

How many mapreduce jobs will run if i query a partitioned table in hive

This may seems a little silly. But just want to know the exact answer. Suppose I've a table with 2 partitions. If a run a query against one partitioned column how many map jobs will run in the background.

Any help would be greatly appreciated!

Thanks in advance

I've read that the # of mappers is determined based on the formula: (size of input divided by block size). Block size for Hadoop 2 is 128 MB.

Therefore I assume you could divide the size of the files in that partition by 128 MB.

So this depends on two things:

  1. By default with non-splittable files, Hadoop will run a Map task for each input file. So if your partition folder has 100 input files, it will run 100 mappers. This would be the default for tab delimited text files for example.

  2. If your files are splittable, it will split based on your blocksize settings. This requires you to use a splittable file format like sequence files.

It's easiest to reason about if you just use simple flat-files. Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM