In Spark summit 2013 one of the yahoo presentation had this formula mentioned:
partitions needed = total data size/(memory size/number of cores)
Assuming a 64Gb memory host with 16 cores of CPU.
The presentation mentioned that to process 3Tb of data, the number of partitions needed is 46080. I am having hard time getting to the same result. Please explain the calculation, how the number 46080 came?
Looking at the presentation (available here ), the information available is:
Your formula should use the uncompressed data size when calculating, therefore, in this case you need to first uncompress it.
Data size = 3Tb * 30 * 2 = 180Tb = 184320Gb
Running it through the formula you get: 184320Gb/(64Gb/16) = 46080 partitions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.