简体   繁体   English

为什么Apache-Spark - Python在本地与熊猫相比如此之慢?

[英]Why is Apache-Spark - Python so slow locally as compared to pandas?

A Spark newbie here. 这里有Spark新手。 I recently started playing around with Spark on my local machine on two cores by using the command: 我最近使用以下命令开始使用两个内核在本地计算机上使用Spark:

pyspark --master local[2]

I have a 393Mb text file which has almost a million rows. 我有一个393Mb的文本文件,有近百万行。 I wanted to perform some data manipulation operation. 我想执行一些数据操作操作。 I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy , sum , max , stddev . 我使用PySpark的内置数据帧函数来执行简单的操作,如groupBysummaxstddev

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency. 然而,当我在完全相同的数据集上对熊猫进行完全相同的操作时,pandas似乎在延迟方面以极大的差距击败了pyspark。

I was wondering what could be a possible reason for this. 我想知道这可能是什么原因。 I have a couple of thoughts. 我有几个想法。

  1. Do built-in functions do the process of serialization/de-serialization inefficiently? 内置函数是否低效地执行序列化/反序列化过程? If yes, what are the alternatives to them? 如果是的话,它们的替代品是什么?
  2. Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs? 数据集是否太小,以至于无法超过运行spark的基础JVM的开销成本?

Thanks for looking. 谢谢你的期待。 Much appreciated. 非常感激。

Because: 因为:

You can go on like this for a long time... 你可以这样长时间继续...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 与某些 python 替代品相比,为什么 pandas.read_fwf function 这么慢? - Why is the pandas.read_fwf function so slow compared to some python alternatives? python上的Apache-Spark错误:java.lang.reflect.InaccessibleObjectException - Apache-Spark error on python : java.lang.reflect.InaccessibleObjectException 为什么 pandas.DataFrames 上的简单操作与 numpy.ndarrays 上的相同操作相比如此缓慢? - Why are simple operations on pandas.DataFrames so slow compared to the same operations on numpy.ndarrays? 使用 Kubernetes、Python 和 Apache-Spark 3.2.0 在客户端模式下运行 spark 的两个单独图像? - Two separate images to run spark in client-mode using Kubernetes, Python with Apache-Spark 3.2.0? 使用Apache-Spark分析时间序列 - Using Apache-Spark to analyze time series 用于协方差计算的 Pandas 与 MLLib 的确切 Apache-Spark NA 处理差异是什么? - What is the Exact Apache-Spark NA Treatment Difference Pandas vs MLLib for Covariance Computation? 输入路径不存在apache-spark - input path does not exist apache-spark 与Java或C#中的相同算法相比,为什么在Python中这种主要筛子这么慢? - Why is this prime sieve so slow in Python compared with the same algorithm in Java or C#? 这个python代码有什么问题,为什么它比ruby运行得那么慢? - Is there something wrong with this python code, why does it run so slow compared to ruby? 如何使用Apache-Spark使python代码在AWS从属节点上运行? - How can I make my python code run on the AWS slave nodes using Apache-Spark?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM