[英]Why is Apache-Spark - Python so slow locally as compared to pandas?
A Spark newbie here. 这里有Spark新手。 I recently started playing around with Spark on my local machine on two cores by using the command:
我最近使用以下命令开始使用两个内核在本地计算机上使用Spark:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. 我有一个393Mb的文本文件,有近百万行。 I wanted to perform some data manipulation operation.
我想执行一些数据操作操作。 I am using the built-in dataframe functions of PySpark to perform simple operations like
groupBy
, sum
, max
, stddev
. 我使用PySpark的内置数据帧函数来执行简单的操作,如
groupBy
, sum
, max
, stddev
。
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency. 然而,当我在完全相同的数据集上对熊猫进行完全相同的操作时,pandas似乎在延迟方面以极大的差距击败了pyspark。
I was wondering what could be a possible reason for this. 我想知道这可能是什么原因。 I have a couple of thoughts.
我有几个想法。
Thanks for looking. 谢谢你的期待。 Much appreciated.
非常感激。
Because: 因为:
You can go on like this for a long time... 你可以这样长时间继续...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.