Let's assume you have written an algorithm in Spark and you can evaluate its performance using 1 .. X
cores on data sets of size N
running in local mode. How would you approach questions like these:
Y
nodes and data size M >> N
? M >> N
using an arbitrary number of nodes? Clearly, this is influenced by countless factors and giving a precise estimate is almost impossible. But how would you come up with an educated guess? Running in local mode mainly allows to measure CPU usage. Is there a rule of thumb to account for disk + network load in shuffles as well? Are there even ways to simulate performance on a cluster?
The data load can be estimated as O(n).
The algorithm can be estimated for each stage. The whole algorithm is an accumulation of all stages. Note, each stage have different amount of data, it has a relation with the first input data.
Assume
Then the time if the algorithm is O(n) T * M / N / X
Then the time if the algorithm is O(n log n) T * M / N / X * log(M/N)
Edit
If there is A big shuffle, then it O(n) respect to bandwidth. The extra time added is dataSize(M)/bandwidth
.
If there are many big shuffle, then consider to improve the algorithm.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.