简体繁体中英

Any benefits of using Pyspark code over SQL in Azure databricks?

原文 2023-01-12 07:04:52 2 3 azure/ databricks/ azure-databricks

I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.

I want to know - Is there any recommended way or best practice to work with Azure databricks? Should we re-write the code in PySpark for the better performance?

Note: End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.

Looking for your help. Thanks !

Expecting - Along with the migration from on prem to Azure. I am looking for some best practices for better performance.

3 answers

Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python ( EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.

So if your SQL code is working already you may continue to use it:

You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows

But often you can get more flexibility or functionality when using Python. For example (this is not a full list):

You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries

But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that ), etc.

You asked a lot of questions there but I'll address the one you asked in the title:

Any benefits of using Pyspark code over SQL?

Yes.

PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.

Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.

These are just my opinions, others may beg to differ.

After getting help on the posted question and doing some research I came up with below response --

It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.

Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)

Note : There may be some optimization techniques in both the languages which we can use to make the performance better.

Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.

Thank you !

Can't write to Azure Sql DataWarehouse from databricks pyspark workers

partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks

Consume events from EventHub In Azure Databricks using pySpark

Azure Databricks & pyspark - substring errors

PySpark Kusto Connector on Azure Databricks

Delete rows from Azure Sql table using Azure Databricks with Scala

can not connect to azure sql dw using azure databricks

Connect to Azure SQL Database from DataBricks using Service Principal

Using SQL Stored Procedure vs Databricks in Azure Data Factory

Finding list of jobs from Azure Databricks notebook using code

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Can't write to Azure Sql DataWarehouse from databricks pyspark workers partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks Consume events from EventHub In Azure Databricks using pySpark Azure Databricks & pyspark - substring errors PySpark Kusto Connector on Azure Databricks Delete rows from Azure Sql table using Azure Databricks with Scala can not connect to azure sql dw using azure databricks Connect to Azure SQL Database from DataBricks using Service Principal Using SQL Stored Procedure vs Databricks in Azure Data Factory Finding list of jobs from Azure Databricks notebook using code

Related Tags

Any benefits of using Pyspark code over SQL in Azure databricks?

Question

3 answers

solution1
0 2023-01-12 09:27:42

solution2
0 2023-01-12 09:41:07

solution3
0 2023-01-23 13:03:12

Any benefits of using Pyspark code over SQL in Azure databricks?

Question

3 answers

solution1 0 2023-01-12 09:27:42

solution2 0 2023-01-12 09:41:07

solution3 0 2023-01-23 13:03:12

solution1
0 2023-01-12 09:27:42

solution2
0 2023-01-12 09:41:07

solution3
0 2023-01-23 13:03:12