简体   繁体   中英

Predicting Spark performance/scalability on cluster?

Let's assume you have written an algorithm in Spark and you can evaluate its performance using 1 .. X cores on data sets of size N running in local mode. How would you approach questions like these:

  • What is the runtime running on a cluster with Y nodes and data size M >> N ?
  • What is the minimum possible runtime for a data set of size M >> N using an arbitrary number of nodes?

Clearly, this is influenced by countless factors and giving a precise estimate is almost impossible. But how would you come up with an educated guess? Running in local mode mainly allows to measure CPU usage. Is there a rule of thumb to account for disk + network load in shuffles as well? Are there even ways to simulate performance on a cluster?

The data load can be estimated as O(n).

The algorithm can be estimated for each stage. The whole algorithm is an accumulation of all stages. Note, each stage have different amount of data, it has a relation with the first input data.

  • If the whole algorithm has O(n), then it's O(n).
  • If the whole algorithm has O(n log n), then it's O(n log n).
  • If the whole algorithm has O(n 2 ), then the algorithm need to be improved to fit M >> N.

Assume

  • There is no huge shuffle/network is fast enough
  • Each node has the same configuration
  • Total time spend is T for data size N on a single node.
  • Number of node is X

Then the time if the algorithm is O(n) T * M / N / X

Then the time if the algorithm is O(n log n) T * M / N / X * log(M/N)

Edit

If there is A big shuffle, then it O(n) respect to bandwidth. The extra time added is dataSize(M)/bandwidth .

If there are many big shuffle, then consider to improve the algorithm.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM