[英]Any benefits of using Pyspark code over SQL in Azure databricks?
I am working on something where I have a SQL code in place already.我正在做一些已经有 SQL 代码的事情。 Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
现在我们正在迁移到 Azure。所以我为转换创建了一个 Azure 数据块,并使用相同的 SQL 代码进行了一些小改动。
I want to know - Is there any recommended way or best practice to work with Azure databricks?我想知道 - 是否有任何推荐的方法或最佳实践来使用 Azure 数据块? Should we re-write the code in PySpark for the better performance?
我们是否应该重写 PySpark 中的代码以获得更好的性能?
Note: End results from the previous SQL code has no bugs.注意:之前 SQL 代码的最终结果没有错误。 Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code.
只是我们要迁移到 Azure。我没有花时间重写代码,而是使用了相同的 SQL 代码。 Now I am looking for suggestions to understand the best practices and how it will make a difference.
现在我正在寻找建议,以了解最佳实践以及它将如何产生影响。
Looking for your help.寻求您的帮助。 Thanks !
谢谢 !
Expecting - Along with the migration from on prem to Azure. I am looking for some best practices for better performance.期待 - 随着从本地迁移到 Azure。我正在寻找一些最佳实践以获得更好的性能。
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine.在幕后,所有代码(SQL/Python/Scala,如果编写正确)都由同一个执行引擎执行。 You can always compare execution plans of SQL & Python (
EXPLAIN <query
for SQL, and dataframe.explain()
for Python) and see that they are the same for same operations.您始终可以比较 SQL 和 Python 的执行计划(对于 SQL 为
EXPLAIN <query
,对于 Python 为dataframe.explain()
)并查看它们对于相同的操作是相同的。
So if your SQL code is working already you may continue to use it:因此,如果您的 SQL 代码已经可以使用,您可以继续使用它:
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):但通常使用 Python 可以获得更多的灵活性或功能。例如(这不是完整列表):
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that ), etc.但实际上,在 Databricks 上,您通常可以将 SQL 和 Python 代码混合并匹配在一起,例如,您可以将 Python 代码公开为用户定义的 function 并从 SQL 调用它( 正在执行此操作的 DLT 管道的小示例)等。
You asked a lot of questions there but I'll address the one you asked in the title:您在那里问了很多问题,但我会解决您在标题中提出的问题:
Any benefits of using Pyspark code over SQL?
使用 Pyspark 代码比使用 SQL 有什么好处?
Yes.是的。
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten.不要误会我的意思,我喜欢 SQL,对于临时探索来说,它是无与伦比的。 There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
在 PySpark 上使用 SQL 有充分合理的理由,但这不是你的问题。
These are just my opinions, others may beg to differ.这些只是我的意见,其他人可能会有所不同。
After getting help on the posted question and doing some research I came up with below response --在获得有关已发布问题的帮助并进行一些研究后,我得出了以下回应 -
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)使用 Python - 用于大量转换(更复杂的数据处理)或用于分析/机器学习目的使用 SQL - 当我们处理关系数据源时(专注于查询和操作存储在关系数据库中的结构化数据)
Note : There may be some optimization techniques in both the languages which we can use to make the performance better.注意:我们可以使用两种语言的一些优化技术来提高性能。
Summary : Choose language based on the use cases.摘要:根据用例选择语言。 Both has the distributed processing because its running on Spark cluster.
两者都具有分布式处理,因为它运行在 Spark 集群上。
Thank you !谢谢 !
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.