为什么Apache-Spark - Python在本地与熊猫相比如此之慢？

Question

A Spark newbie here. 这里有Spark新手。 I recently started playing around with Spark on my local machine on two cores by using the command: 我最近使用以下命令开始使用两个内核在本地计算机上使用Spark：

pyspark --master local[2]

I have a 393Mb text file which has almost a million rows. 我有一个393Mb的文本文件，有近百万行。 I wanted to perform some data manipulation operation. 我想执行一些数据操作操作。 I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy , sum , max , stddev . 我使用PySpark的内置数据帧函数来执行简单的操作，如groupBy ， sum ， max ， stddev 。

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency. 然而，当我在完全相同的数据集上对熊猫进行完全相同的操作时，pandas似乎在延迟方面以极大的差距击败了pyspark。

I was wondering what could be a possible reason for this. 我想知道这可能是什么原因。 I have a couple of thoughts. 我有几个想法。

Do built-in functions do the process of serialization/de-serialization inefficiently? 内置函数是否低效地执行序列化/反序列化过程？ If yes, what are the alternatives to them? 如果是的话，它们的替代品是什么？
Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs? 数据集是否太小，以至于无法超过运行spark的基础JVM的开销成本？

Thanks for looking. 谢谢你的期待。 Much appreciated. 非常感激。

Answer 1

Because: 因为：

Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Apache Spark是一个复杂的框架，旨在将处理分布在数百个节点上，同时确保正确性和容错性。 Each of these properties has significant cost. 每种属性都有很高的成本。
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark). 因为纯粹的内存内核处理（Pandas）比磁盘和网络（甚至是本地）I / O（Spark）快几个数量级。
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements. 因为并行性（和分布式处理）会增加显着的开销，即使具有最佳（令人尴尬的并行工作负载）也不能保证任何性能改进。
Because local mode is not designed for performance. 因为本地模式不是为性能而设计的。 It is used for testing. 它用于测试。
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution 最后但并非最不重要 - 在393MB上运行的2个核心不足以看到任何性能改进，单个节点不提供任何分发机会
Also Spark: Inconsistent performance number in scaling number of cores , Why is pyspark so much slower in finding the max of a column? 还有Spark：核心数量不一致的性能数量，为什么pyspark在查找列的最大值时要慢得多？ , Why does my Spark run slower than pure Python? 为什么我的Spark运行速度比纯Python慢？ Performance comparison 性能比较

You can go on like this for a long time... 你可以这样长时间继续...

为什么Apache-Spark - Python在本地与熊猫相比如此之慢？

问题描述

1 个解决方案

解决方案1
41 已采纳 2018-02-15 20:26:35

为什么Apache-Spark - Python在本地与熊猫相比如此之慢？

问题描述

1 个解决方案

解决方案1 41 已采纳 2018-02-15 20:26:35

解决方案1
41 已采纳 2018-02-15 20:26:35