简体繁体中英

Predicting Spark performance/scalability on cluster?

原文 2016-07-12 19:17:43 1 1 performance/ apache-spark/ benchmarking/ scalability

Let's assume you have written an algorithm in Spark and you can evaluate its performance using 1 .. X cores on data sets of size N running in local mode. How would you approach questions like these:

What is the runtime running on a cluster with Y nodes and data size M >> N ?
What is the minimum possible runtime for a data set of size M >> N using an arbitrary number of nodes?

Clearly, this is influenced by countless factors and giving a precise estimate is almost impossible. But how would you come up with an educated guess? Running in local mode mainly allows to measure CPU usage. Is there a rule of thumb to account for disk + network load in shuffles as well? Are there even ways to simulate performance on a cluster?

1 answers

The data load can be estimated as O(n).

The algorithm can be estimated for each stage. The whole algorithm is an accumulation of all stages. Note, each stage have different amount of data, it has a relation with the first input data.

If the whole algorithm has O(n), then it's O(n).
If the whole algorithm has O(n log n), then it's O(n log n).
If the whole algorithm has O(n ² ), then the algorithm need to be improved to fit M >> N.

Assume

There is no huge shuffle/network is fast enough
Each node has the same configuration
Total time spend is T for data size N on a single node.
Number of node is X

Then the time if the algorithm is O(n) T * M / N / X

Then the time if the algorithm is O(n log n) T * M / N / X * log(M/N)

Edit

If there is A big shuffle, then it O(n) respect to bandwidth. The extra time added is dataSize(M)/bandwidth .

If there are many big shuffle, then consider to improve the algorithm.

Spark scalability

Kafka + Spark scalability

NodeJS / ERP - Performance / Scalability

Scalability and performance in Meteor

web app scalability and performance

NHibernate Performance and Scalability

java + increasing performance and scalability

Understanding REST in terms of scalability, performance,

Performance and scalability of applications in parallel computers

Akka scalability and performance benchmark testcases

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark scalability Kafka + Spark scalability NodeJS / ERP - Performance / Scalability Scalability and performance in Meteor web app scalability and performance NHibernate Performance and Scalability java + increasing performance and scalability Understanding REST in terms of scalability, performance, Performance and scalability of applications in parallel computers Akka scalability and performance benchmark testcases

Related Tags

Predicting Spark performance/scalability on cluster?

Question

1 answers

solution1 2 2016-07-12 21:17:16

solution1
2 2016-07-12 21:17:16