[英]Spark: Why does Python significantly outperform Scala in my use case?
To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime.为了在使用 Python 和 Scala 时比较 Spark 的性能,我在两种语言中创建了相同的作业并比较了运行时。 I expected both jobs to take roughly the same amount of time, but Python job took only 27min
, while Scala job took 37min
(almost 40% longer!).我预计这两项工作花费的时间大致相同,但 Python 工作只花了27min
,而 Scala 工作花了37min
(几乎长了 40%!)。 I implemented the same job in Java as well and it took 37minutes
too.我也用 Java 实现了同样的工作,也花了37minutes
。 How is this possible that Python is so much faster? Python 怎么可能这么快?
Minimal verifiable example:最小可验证示例:
Python job:蟒蛇作业:
# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)
# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()
print(a, b)
Scala job:斯卡拉工作:
// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
// 960 Files from a public dataset in 2 batches
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()
println(s"Lines with a: $num1, Lines with b: $num2")
Just by looking at the code, they seem to be identical.仅通过查看代码,它们似乎是相同的。 I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).我查看了 DAG,他们没有提供任何见解(或者至少我缺乏基于它们提出解释的专业知识)。
I would really appreciate any pointers.我真的很感激任何指示。
Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect.您的基本假设是 Scala 或 Java 应该更快完成此特定任务,这是不正确的。 You can easily verify it with minimal local applications.您可以使用最少的本地应用程序轻松验证它。 Scala one:斯卡拉一:
import scala.io.Source
import java.time.{Duration, Instant}
object App {
def main(args: Array[String]) {
val Array(filename, string) = args
val start = Instant.now()
Source
.fromFile(filename)
.getLines
.filter(line => line.startsWith(string))
.length
val stop = Instant.now()
val duration = Duration.between(start, stop).toMillis
println(s"${start},${stop},${duration}")
}
}
Python one蟒蛇一
import datetime
import sys
if __name__ == "__main__":
_, filename, string = sys.argv
start = datetime.datetime.now()
with open(filename) as fr:
# Not idiomatic or the most efficient but that's what
# PySpark will use
sum(1 for _ in filter(lambda line: line.startswith(string), fr))
end = datetime.datetime.now()
duration = round((end - start).total_seconds() * 1000)
print(f"{start},{end},{duration}")
Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml
from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:结果(每个重复 300 次,Python 3.7.6,Scala 2.11.12),来自hermeneutics.stackexchange.com 数据转储的Posts.xml
,混合了匹配和非匹配模式:
As you see Python is not only systematically faster, but also is more consistent (lower spread).如您所见,Python 不仅系统地更快,而且更一致(更低的传播)。
Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.带走的信息是——不要相信未经证实的FUD——语言在特定任务或特定环境中可能会更快或更慢(例如这里 Scala 可能会受到 JVM 启动和/或 GC 和/或 JIT 的影响),但如果你声称像“XYZ 比 ZYX 快 X4”或“XYZ 比 ZYX 慢(..)大约慢 10 倍”,这通常意味着有人编写了非常糟糕的代码来测试事物。
Edit :编辑:
To address some concerns raised in the comments:为了解决评论中提出的一些问题:
local_connect_and_auth
, and its nothing else than socket associated file ).通信在本地完成的插槽(所有通信对工人超越最初的连接和使用进行身份验证的文件描述符从返回local_connect_and_auth
,它不是别的, 插座相关文件)。 Again, as cheap as it gets when it comes to communication between processes.同样,在进程之间的通信方面尽可能便宜。Edit 2 :编辑2 :
Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.由于这里jasper-m关注的是启动成本,因此可以很容易地证明,即使输入大小显着增加,Python 仍然比 Scala 具有显着优势。
Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.以下是 2003360 行/5.6G 的结果(相同的输入,只是重复多次,重复 30 次),这超出了您在单个 Spark 任务中所能期望的任何结果。
Please note non-overlapping confidence intervals.请注意不重叠的置信区间。
Edit 3 :编辑3 :
To address another comment from Jasper-M:要解决 Jasper-M 的另一条评论:
The bulk of all the processing is still happening inside a JVM in the Spark case.在 Spark 案例中,大部分处理仍然发生在 JVM 中。
That is simply incorrect in this particular case:在这种特殊情况下,这是完全不正确的:
DataFrame
) implement gross of functionality natively in Python, with exception input, output and inter-node communication. PySpark RDD(与我们说的DataFrame
)在 Python 中本机实现了所有功能,具有异常输入、输出和节点间通信。The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources. Scala 作业需要更长的时间,因为它配置错误,因此 Python 和 Scala 作业被提供了不平等的资源。
There are two mistakes in the code:代码中有两个错误:
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
sc.hadoopConfiguration
is a wrong place to set any Spark configuration. sc.hadoopConfiguration
是设置任何 Spark 配置的错误位置。 It should be set in the config
instance you pass to new SparkContext(config)
.它应该在您传递给new SparkContext(config)
的config
实例中设置。[ADDED] Bearing the above in mind, I would propose to change the code of the Scala job to [添加] 考虑到上述情况,我建议将 Scala 作业的代码更改为
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
and re-test it again.并再次重新测试。 I bet the Scala version is going to be X times faster now.我敢打赌 Scala 版本现在会快 X 倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.