Spark：为什么 Python 在我的用例中明显优于 Scala？

Question

To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime.为了在使用 Python 和 Scala 时比较 Spark 的性能，我在两种语言中创建了相同的作业并比较了运行时。 I expected both jobs to take roughly the same amount of time, but Python job took only 27min , while Scala job took 37min (almost 40% longer!).我预计这两项工作花费的时间大致相同，但 Python 工作只花了27min ，而 Scala 工作花了37min （几乎长了 40%！）。 I implemented the same job in Java as well and it took 37minutes too.我也用 Java 实现了同样的工作，也花了37minutes 。 How is this possible that Python is so much faster? Python 怎么可能这么快？

Minimal verifiable example:最小可验证示例：

Python job:蟒蛇作业：

# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)

# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"

# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()

print(a, b)

Scala job:斯卡拉工作：

// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

// 960 Files from a public dataset in 2 batches 
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"

// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()

println(s"Lines with a: $num1, Lines with b: $num2")

Just by looking at the code, they seem to be identical.仅通过查看代码，它们似乎是相同的。 I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).我查看了 DAG，他们没有提供任何见解（或者至少我缺乏基于它们提出解释的专业知识）。

I would really appreciate any pointers.我真的很感激任何指示。

Answer 1

Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect.您的基本假设是 Scala 或 Java 应该更快完成此特定任务，这是不正确的。 You can easily verify it with minimal local applications.您可以使用最少的本地应用程序轻松验证它。 Scala one:斯卡拉一：

import scala.io.Source
import java.time.{Duration, Instant}

object App {
  def main(args: Array[String]) {
    val Array(filename, string) = args

    val start = Instant.now()

    Source
      .fromFile(filename)
      .getLines
      .filter(line => line.startsWith(string))
      .length

    val stop = Instant.now()
    val duration = Duration.between(start, stop).toMillis
    println(s"${start},${stop},${duration}")
  }
}

Python one蟒蛇一

import datetime
import sys

if __name__ == "__main__":
    _, filename, string = sys.argv
    start = datetime.datetime.now()
    with open(filename) as fr:
        # Not idiomatic or the most efficient but that's what
        # PySpark will use
        sum(1 for _ in filter(lambda line: line.startswith(string), fr))

    end = datetime.datetime.now()
    duration = round((end - start).total_seconds() * 1000)
    print(f"{start},{end},{duration}")

Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:结果（每个重复 300 次，Python 3.7.6，Scala 2.11.12），来自hermeneutics.stackexchange.com 数据转储的Posts.xml ，混合了匹配和非匹配模式：

Python 273.50 (258.84, 288.16)蟒蛇 273.50 (258.84, 288.16)
Scala 634.13 (533.81, 734.45)斯卡拉 634.13 (533.81, 734.45)

As you see Python is not only systematically faster, but also is more consistent (lower spread).如您所见，Python 不仅系统地更快，而且更一致（更低的传播）。

Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.带走的信息是——不要相信未经证实的FUD——语言在特定任务或特定环境中可能会更快或更慢（例如这里 Scala 可能会受到 JVM 启动和/或 GC 和/或 JIT 的影响），但如果你声称像“XYZ 比 ZYX 快 X4”或“XYZ 比 ZYX 慢（..）大约慢 10 倍”，这通常意味着有人编写了非常糟糕的代码来测试事物。

Edit :编辑：

To address some concerns raised in the comments:为了解决评论中提出的一些问题：

In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side).在 OP 代码中，数据主要在一个方向（JVM -> Python）传递，不需要真正的序列化（此特定路径只是按原样传递字节串并在另一侧以 UTF-8 解码）。 That's as cheap as it gets when it comes to "serialization".当涉及到“序列化”时，这很便宜。
What is passed back is just a single integer by partition, so in that direction impact is negligible.传回的只是分区的单个整数，因此在这个方向上的影响可以忽略不计。
Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth , and its nothing else than socket associated file ).通信在本地完成的插槽（所有通信对工人超越最初的连接和使用进行身份验证的文件描述符从返回local_connect_and_auth ，它不是别的，插座相关文件）。 Again, as cheap as it gets when it comes to communication between processes.同样，在进程之间的通信方面尽可能便宜。
Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.考虑到上面显示的原始性能的差异（远高于您在程序中看到的），上面列出的开销有很大的余量。
This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).这种情况与简单或复杂对象必须以双方都可以访问的形式作为pickle兼容转储传递到Python解释器或从Python解释器传递的情况完全不同（最显着的例子包括旧式UDF，旧式UDF的某些部分风格的 MLLib）。

Edit 2 :编辑2 ：

Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.由于这里jasper-m关注的是启动成本，因此可以很容易地证明，即使输入大小显着增加，Python 仍然比 Scala 具有显着优势。

Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.以下是 2003360 行/5.6G 的结果（相同的输入，只是重复多次，重复 30 次），这超出了您在单个 Spark 任务中所能期望的任何结果。

Python 22809.57 (21466.26, 24152.87)蟒蛇 22809.57 (21466.26, 24152.87)
Scala 27315.28 (24367.24, 30263.31)斯卡拉 27315.28 (24367.24, 30263.31)

Please note non-overlapping confidence intervals.请注意不重叠的置信区间。

Edit 3 :编辑3 ：

To address another comment from Jasper-M:要解决 Jasper-M 的另一条评论：

The bulk of all the processing is still happening inside a JVM in the Spark case.在 Spark 案例中，大部分处理仍然发生在 JVM 中。

That is simply incorrect in this particular case:在这种特殊情况下，这是完全不正确的：

The job in question is map job with single global reduce using PySpark RDDs.有问题的工作是使用 PySpark RDD 进行单一全局归约的映射工作。
PySpark RDD (unlike let's say DataFrame ) implement gross of functionality natively in Python, with exception input, output and inter-node communication. PySpark RDD（与我们说的DataFrame ）在 Python 中本机实现了所有功能，具有异常输入、输出和节点间通信。
Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.由于它是单阶段作业，并且最终输出小到可以忽略不计，JVM 的主要职责（如果是吹毛求疵的话，这主要是在 Java 而非 Scala 中实现的）是调用 Hadoop 输入格式，并通过套接字推送数据文件到 Python。
The read part is identical for JVM and Python API, so it can be considered as constant overhead. JVM 和 Python API 的读取部分是相同的，因此可以将其视为常量开销。 It also doesn't qualify as the bulk of the processing , even for such simple job like this one.即使对于像这样的简单工作，它也不符合大部分处理的要求。

Answer 2

The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources. Scala 作业需要更长的时间，因为它配置错误，因此 Python 和 Scala 作业被提供了不平等的资源。

There are two mistakes in the code:代码中有两个错误：

val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5

LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. LINE 1. 执行完该行后，Spark 作业的资源配置就已经建立并修复了。 From this point on, no way to adjust anything.从这一点上，没有办法调整任何东西。 Neither the number of executors nor the number of cores per executor.既不是执行程序的数量，也不是每个执行程序的内核数量。
LINE 4-5.第 4-5 行。 sc.hadoopConfiguration is a wrong place to set any Spark configuration. sc.hadoopConfiguration是设置任何 Spark 配置的错误位置。 It should be set in the config instance you pass to new SparkContext(config) .它应该在您传递给new SparkContext(config)的config实例中设置。

[ADDED] Bearing the above in mind, I would propose to change the code of the Scala job to [添加] 考虑到上述情况，我建议将 Scala 作业的代码更改为

config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

and re-test it again.并再次重新测试。 I bet the Scala version is going to be X times faster now.我敢打赌 Scala 版本现在会快 X 倍。

Spark：为什么 Python 在我的用例中明显优于 Scala？

问题描述

2 个解决方案

解决方案1
14 已采纳 2020-02-25 16:15:01

解决方案2
5 2020-02-24 11:31:00

Spark：为什么 Python 在我的用例中明显优于 Scala？

问题描述

2 个解决方案

解决方案1 14 已采纳 2020-02-25 16:15:01

解决方案2 5 2020-02-24 11:31:00

解决方案1
14 已采纳 2020-02-25 16:15:01

解决方案2
5 2020-02-24 11:31:00