简体   繁体   中英

Spark execution time vs number of nodes on AWS EMR

I'm new to Spark. I tried to run a simple application on Amazon EMR (Python pi approximation found here ) with 1 worker node and in a second phase with 2 worker nodes (m4.large). Elapsed time to complete the task is approximately 25 seconds each time. Naively, I was expecting something like a 1.5x gain with 2 nodes. Am I naive? Is it normal?

This question is quite broad, thus my answer will be as well broad, but you'll get the picture.

More machines doesn't mean always faster computations and specially not on a Pi approximation.

You shouldn't forget about eventual bottlenecks : network I/O, data skewness, expensive transformations, partitioning and such.

That's why benchmarking and monitoring should be done. Also you might be counting the time the Spark context needs to setup and tear down which might be a big part of your computational time.

Plus an m4.large is quite a powerful machine to use for this purpose. If you setup ganglia on your EMR cluster, you'll notice that spark is barely using its resources which leads you to think about tuning when launching a Spark application on EMR.

Now to answer your question. Yes, that behavior is normal for the application that you are launching.

Here is a post I wrote a while ago about improving latency on a single node apache spark cluster that might give you more information about this topic.

Let's make a simple experiment:

from functools import reduce
from operator import add
import timeit

# Taken from the linked example.

n = 100000

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

%timeit -n 100 reduce(add, (f(x) for x  in range(n)))

The result I get using quite old hardware:

100 loops, best of 3: 132 ms per loop

This should be an expected processing time for a single partition and value we get is comparable to the task scheduling time.

Conclusion? What you measure is cluster and application latency (context initialization, scheduling delays, context teardown) not a processing time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM