简体   繁体   English

如何减少大数据 PySpark 脚本的运行时间?

[英]How do I reduce the run-time for Big Data PySpark scripts?

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive.我目前正在 Databricks 中开展一个项目,单个表中大约有 6 GiB 的数据,因此您可以想象,在这样的表上运行时是非常昂贵的。 I would call myself an experienced coder, but with big data I am still fresh.我会称自己为经验丰富的编码员,但对于大数据我还是新鲜的。
When working on smaller datasets I will often test parts of my code to see if it runs properly, eg:在处理较小的数据集时,我会经常测试我的部分代码以查看它是否正常运行,例如:

df[df[col1] == 5]

However, with big data such as this, filtering jobs can take many minutes to run.然而,对于这样的大数据,过滤作业可能需要很长时间才能运行。
Something that I am noticing is that the run-time seems to increase as I continue my transformations in the notebook, even after massively reducing the size of the table.我注意到的一点是,随着我在笔记本中继续进行转换,运行时间似乎增加了,即使在大幅减小表的大小之后也是如此。

Is there some kind of cache that needs to be emptied as I go along with coding within the notebook script?当我在笔记本脚本中进行编码时,是否需要清空某种缓存? Or do I just have to live with long run-times when dealing with sizes such as these?或者在处理诸如此类的尺寸时,我是否只需要忍受较长的运行时间?
I don't want to start increasing the size of my computing cluster if I can reduce run-time by simply improve my code.如果我可以通过简单地改进我的代码来减少运行时间,我不想开始增加我的计算集群的大小。

I realize that this question is quite broad, but any tips or tricks would be greatly appreciated.我意识到这个问题很广泛,但是任何提示或技巧将不胜感激。

It's likely in your col1 == 5 example it has to do a complete scan of every row in the table to find the (possibly single) row with a value of 5.在您的col1 == 5示例中,它很可能必须对表中的每一行进行完整扫描,以找到值为 5 的(可能是单个)行。

If you don't need a precise value for testing, you can use.limit(), which will efficiently take only the first rows the database happens to come across.如果您不需要用于测试的精确值,您可以使用 .limit(),它将有效地只获取数据库碰巧遇到的第一行。

Likewise, if you know there's only a single col1 == 5 row, df.select(F.col('col1') == 5).limit(1) will tell Spark to stop searching the table once it finds your row, which is going to be a minor win most of the time.同样,如果您知道只有一个col1 == 5行, df.select(F.col('col1') == 5).limit(1)将告诉 Spark 在找到您的行后停止搜索表,大多数时候这将是一个小胜利。

.sample() might also help with total runtime while still testing a meaningful subset of the table (put on your Central Limit Theorem hat.). .sample() 也可能有助于总运行时间,同时仍然测试表的一个有意义的子集(戴上你的中心极限定理帽子。)。

cache(), persist() and checkpoint() are also all useful, and explained in this question: What is the difference between spark checkpoint and persist to a disk cache()、persist() 和 checkpoint() 也都很有用,并在这个问题中进行了解释: What is the difference between spark checkpoint and persist to a disk

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM