简体繁体 English

在 Azure 数据块中使用 Pyspark 代码比使用 SQL 有什么好处？

[英]Any benefits of using Pyspark code over SQL in Azure databricks?

原文 2023-01-12 07:04:52 5 3 azure/ databricks/ azure-databricks

I am working on something where I have a SQL code in place already.我正在做一些已经有 SQL 代码的事情。 Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.现在我们正在迁移到 Azure。所以我为转换创建了一个 Azure 数据块，并使用相同的 SQL 代码进行了一些小改动。

I want to know - Is there any recommended way or best practice to work with Azure databricks?我想知道 - 是否有任何推荐的方法或最佳实践来使用 Azure 数据块？ Should we re-write the code in PySpark for the better performance?我们是否应该重写 PySpark 中的代码以获得更好的性能？

Note: End results from the previous SQL code has no bugs.注意：之前 SQL 代码的最终结果没有错误。 Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code.只是我们要迁移到 Azure。我没有花时间重写代码，而是使用了相同的 SQL 代码。 Now I am looking for suggestions to understand the best practices and how it will make a difference.现在我正在寻找建议，以了解最佳实践以及它将如何产生影响。

Looking for your help.寻求您的帮助。 Thanks !谢谢！

Expecting - Along with the migration from on prem to Azure. I am looking for some best practices for better performance.期待 - 随着从本地迁移到 Azure。我正在寻找一些最佳实践以获得更好的性能。

3 个解决方案

Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine.在幕后，所有代码（SQL/Python/Scala，如果编写正确）都由同一个执行引擎执行。 You can always compare execution plans of SQL & Python ( EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.您始终可以比较 SQL 和 Python 的执行计划（对于 SQL 为EXPLAIN <query ，对于 Python 为dataframe.explain() ）并查看它们对于相同的操作是相同的。

So if your SQL code is working already you may continue to use it:因此，如果您的 SQL 代码已经可以使用，您可以继续使用它：

You can trigger SQL queries/dashboards/alerts from Databricks Workflows您可以从 Databricks Workflows 触发 SQL 查询/仪表板/警报
You can use SQL operations in Delta Live Tables (DLT)您可以在 Delta Live Tables (DLT) 中使用 SQL 操作
You can use DBT together with Dataricks Workflows您可以将DBT 与 Dataricks Workflows 一起使用

But often you can get more flexibility or functionality when using Python. For example (this is not a full list):但通常使用 Python 可以获得更多的灵活性或功能。例如（这不是完整列表）：

You can programmatically generate DLT tables that are performing the same transformations but on different tables您可以以编程方式生成执行相同转换但在不同表上的 DLT 表
You can use streaming sources (SQL support for streaming isn't very broad yet)您可以使用流媒体源（SQL 对流媒体的支持还不是很广泛）
You need to integrate your code with some 3rd party libraries您需要将您的代码与一些 3rd 方库集成

But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that ), etc.但实际上，在 Databricks 上，您通常可以将 SQL 和 Python 代码混合并匹配在一起，例如，您可以将 Python 代码公开为用户定义的 function 并从 SQL 调用它（正在执行此操作的 DLT 管道的小示例）等。

You asked a lot of questions there but I'll address the one you asked in the title:您在那里问了很多问题，但我会解决您在标题中提出的问题：

Any benefits of using Pyspark code over SQL?使用 Pyspark 代码比使用 SQL 有什么好处？

Yes.是的。

PySpark is easier to test. PySpark 更容易测试。 For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest).例如，在 PySpark 中编写的转换可以抽象为 python function 然后可以在测试中单独执行，因此您可以使用无数的 python 测试框架之一（我个人是测试）。 This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.这对于 SQL 来说并不容易，其中转换存在于整个 SQL 语句的范围内，并且在不使用视图或用户定义函数的情况下无法抽象化，这些视图或用户定义函数是需要创建的物理数据库对象。
PySpark is more composable. PySpark 更具可组合性。 One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.可以将来自不同地方（可能由不同人编写的）的自定义逻辑放在一起来定义端到端的 ETL 过程。
PySpark's lazy evaluation is a beautiful thing. PySpark 的惰性评估是一件很美好的事情。 It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.它允许您以探索的方式组成 ETL 过程，像 go 一样进行更改。这确实使 PySpark（以及一般的 Spark）成为一件好事，而惰性评估的好处无法真正解释，它必须有经验。

Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten.不要误会我的意思，我喜欢 SQL，对于临时探索来说，它是无与伦比的。 There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.在 PySpark 上使用 SQL 有充分合理的理由，但这不是你的问题。

These are just my opinions, others may beg to differ.这些只是我的意见，其他人可能会有所不同。

After getting help on the posted question and doing some research I came up with below response --在获得有关已发布问题的帮助并进行一些研究后，我得出了以下回应 -

It does not matter which language do you choose (SQL or python).选择哪种语言（SQL 或 Python）并不重要。 Since it uses Spark cluster, so Sparks distributes it across cluster.由于它使用 Spark 集群，因此 Sparks 将其分布在集群中。 It depends on specific use cases where to use what.这取决于特定的用例在哪里使用什么。
Both SQL and PySpark dataframe intermediate results gets stored in memory. SQL 和 PySpark dataframe 中间结果都存储在 memory 中。
In a same notebook we can use both the languages depending upon the situation.在同一个笔记本中，我们可以根据情况使用两种语言。

Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)使用 Python - 用于大量转换（更复杂的数据处理）或用于分析/机器学习目的使用 SQL - 当我们处理关系数据源时（专注于查询和操作存储在关系数据库中的结构化数据）

Note : There may be some optimization techniques in both the languages which we can use to make the performance better.注意：我们可以使用两种语言的一些优化技术来提高性能。

Summary : Choose language based on the use cases.摘要：根据用例选择语言。 Both has the distributed processing because its running on Spark cluster.两者都具有分布式处理，因为它运行在 Spark 集群上。

Thank you !谢谢！