简体   繁体   中英

Number of partitions needed in spark

In Spark summit 2013 one of the yahoo presentation had this formula mentioned:

partitions needed = total data size/(memory size/number of cores)

Assuming a 64Gb memory host with 16 cores of CPU.

The presentation mentioned that to process 3Tb of data, the number of partitions needed is 46080. I am having hard time getting to the same result. Please explain the calculation, how the number 46080 came?

Looking at the presentation (available here ), the information available is:

  • 64Gb memory host
  • 16 core cpu
  • Compression rato 30:1, 2 times overhead

Your formula should use the uncompressed data size when calculating, therefore, in this case you need to first uncompress it.

Data size = 3Tb * 30 * 2 = 180Tb = 184320Gb

Running it through the formula you get: 184320Gb/(64Gb/16) = 46080 partitions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM