简体繁体 English

使用 PySpark 而不是 Pandas 有什么意义？

[英]What is the point in using PySpark over Pandas?

原文 2022-11-30 11:10:09 1 1 python/ pandas/ pyspark/ preprocessor

I've been learning Spark recently (PySpark to be more precise) and at first it seemed really useful and powerful to me.我最近一直在学习 Spark（更准确地说是 PySpark），起初它对我来说似乎非常有用和强大。 Like you can process Gb of data in parallel so it can me much faster than processing it with classical tool... right?就像你可以并行处理 Gb 的数据，所以它比用经典工具处理它要快得多......对吧？ So I wanted to try by myself to be convinced.所以我想自己尝试一下才能被说服。

So I downloaded a csv file of almost 1GB, ~ten millions of rows (link: https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz ) and wanted to try to process it with Spark and with Pandas to see the difference.所以我下载了一个 csv 文件，大约 1GB，~1000 万行（链接： https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz ）并想尝试用 Spark 和 Pandas 处理它以查看区别。

So the goal was just to read the file and count of many rows were there for a certain date.所以我们的目标只是读取文件，并且在某个日期有很多行的计数。 I tried with PySpark:我试过 PySpark：

Preprocess with PySpark使用 PySpark 进行预处理

and with pandas:和 pandas：

Preprocess with Pandas用 Pandas 预处理

Which obviously gives the same result, but it take about 1mn30 for PySpark and only (.) about 30s for Pandas.这显然给出了相同的结果，但是 PySpark 大约需要 100 万，而 Pandas 只需要 (.) 大约 30 秒。

I feel like I missed something but I don't know what.我觉得我错过了什么，但我不知道是什么。 Why does it take much more time with PySpark?为什么 PySpark 需要更多时间？ Shouldn't be the contrary?不应该相反吗？

EDIT: I did not show my Spark configuration, but I am just using it locally so maybe this can be the explanation?编辑：我没有显示我的 Spark 配置，但我只是在本地使用它，所以也许这可以解释？

1 个解决方案

Spark is a distributed processing framework. Spark 是一个分布式处理框架。 That means that, in order to use it at it's full potential, you must deploy it on a cluster of machines (called nodes ): the processing is then parallelized and distributed across them.这意味着，为了充分发挥它的潜力，您必须将其部署在机器集群（称为节点）上：然后将处理并行化并分布在它们之间。 This usually happens on cloud platforms like Google Cloud or AWS.这通常发生在 Google Cloud 或 AWS 等云平台上。 Another interesting option to check out is Databricks.另一个有趣的检查选项是 Databricks。

If you use it on your local machine it would run on a single node, therefore it will be just a worse version of Pandas. That's fine for learning purposes but it's not the way it is meant to be used.如果您在本地计算机上使用它，它将在单个节点上运行，因此它只是 Pandas 的一个更糟糕的版本。这对于学习目的来说很好，但它不是它应该使用的方式。

For more informations about how a Spark cluster works check the documentation: https://spark.apache.org/docs/latest/cluster-overview.html Keep in mind that is a very deep topic, and it would take a while to decently understand everything...有关 Spark 集群如何工作的更多信息，请查看文档： https://spark.apache.org/docs/latest/cluster-overview.html请记住，这是一个非常深奥的话题，需要一段时间才能体面地理解明白一切...