简体   繁体   中英

Python vs Scala (for Spark jobs)

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.

So here is the situation, I run same spark jobs with pyspark and spark-shell.

This is from pyspark:

textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()

And this one from spark-shell:

textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()

I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?

UPDATE

So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.

I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.

With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?

火花壳

皮斯帕克

Using python has some overhead, but it's significance depends on what you're doing. Though recent reports indicate the overhead isn't very large ( specifically for the new DataFrame API )

some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs. You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.

Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM