如何减少大数据 PySpark 脚本的运行时间？

Question

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive.我目前正在 Databricks 中开展一个项目，单个表中大约有 6 GiB 的数据，因此您可以想象，在这样的表上运行时是非常昂贵的。 I would call myself an experienced coder, but with big data I am still fresh.我会称自己为经验丰富的编码员，但对于大数据我还是新鲜的。
When working on smaller datasets I will often test parts of my code to see if it runs properly, eg:在处理较小的数据集时，我会经常测试我的部分代码以查看它是否正常运行，例如：

df[df[col1] == 5]

However, with big data such as this, filtering jobs can take many minutes to run.然而，对于这样的大数据，过滤作业可能需要很长时间才能运行。
Something that I am noticing is that the run-time seems to increase as I continue my transformations in the notebook, even after massively reducing the size of the table.我注意到的一点是，随着我在笔记本中继续进行转换，运行时间似乎增加了，即使在大幅减小表的大小之后也是如此。

Is there some kind of cache that needs to be emptied as I go along with coding within the notebook script?当我在笔记本脚本中进行编码时，是否需要清空某种缓存？ Or do I just have to live with long run-times when dealing with sizes such as these?或者在处理诸如此类的尺寸时，我是否只需要忍受较长的运行时间？
I don't want to start increasing the size of my computing cluster if I can reduce run-time by simply improve my code.如果我可以通过简单地改进我的代码来减少运行时间，我不想开始增加我的计算集群的大小。

I realize that this question is quite broad, but any tips or tricks would be greatly appreciated.我意识到这个问题很广泛，但是任何提示或技巧将不胜感激。

Answer 1

It's likely in your col1 == 5 example it has to do a complete scan of every row in the table to find the (possibly single) row with a value of 5.在您的col1 == 5示例中，它很可能必须对表中的每一行进行完整扫描，以找到值为 5 的（可能是单个）行。

If you don't need a precise value for testing, you can use.limit(), which will efficiently take only the first rows the database happens to come across.如果您不需要用于测试的精确值，您可以使用 .limit()，它将有效地只获取数据库碰巧遇到的第一行。

Likewise, if you know there's only a single col1 == 5 row, df.select(F.col('col1') == 5).limit(1) will tell Spark to stop searching the table once it finds your row, which is going to be a minor win most of the time.同样，如果您知道只有一个col1 == 5行， df.select(F.col('col1') == 5).limit(1)将告诉 Spark 在找到您的行后停止搜索表，大多数时候这将是一个小胜利。

.sample() might also help with total runtime while still testing a meaningful subset of the table (put on your Central Limit Theorem hat.). .sample() 也可能有助于总运行时间，同时仍然测试表的一个有意义的子集（戴上你的中心极限定理帽子。）。

cache(), persist() and checkpoint() are also all useful, and explained in this question: What is the difference between spark checkpoint and persist to a disk cache()、persist() 和 checkpoint() 也都很有用，并在这个问题中进行了解释： What is the difference between spark checkpoint and persist to a disk

如何减少大数据 PySpark 脚本的运行时间？

问题描述

1 个解决方案

解决方案1
0 2022-12-19 14:25:00

如何减少大数据 PySpark 脚本的运行时间？

问题描述

1 个解决方案

解决方案1 0 2022-12-19 14:25:00

解决方案1
0 2022-12-19 14:25:00