简体繁体中英

Combine batch data to delta format in a data lake using synapse and pyspark?

原文 2022-07-09 14:31:24 7 1 pyspark/ azure-synapse/ delta-lake/ azure-data-lake-gen2/ delta

I currently have a data lake with several daily interval tables of data in the bronze layer of a data lake. They are in csv format and regularly new daily csv tables are ingested to the bronze folder.

I would like to transform them eg editing some rows，change column names and save to delta format in the silver layer. What would be best practice while using Synapse Analytics and pyspark? I have used synapse notebook so far for transformation but with my limited pyspark knowledge I could only cleanse the data and save each daily table in delta format. What would be the code like for filtering the relevant data file names in a folder, only take the newest tables and combine to 1 delta table in the silver layer folder? In my search I only found streaming data and autoloader for databricks but I don't think synapse posses of autoloader functionality yet. Or how else would a delta lake handle these kind of scenarios?

Thanks and best regards!

1 answers

I am doing something similar. I ingest CSV files for different entities into Delta Lake bronze layer. For each entity I have a separate delta lake table. In the bronze layer I add additional columns as audit columns to capture filename, fileloaddate etc. Then use the latest fileloaddate to load data from bronze to silver. Hope this helps.

Efficient reading/transforming partitioned data in delta lake

pyspark write stream to delta no data

pyspark delta lake optimize - fails to parse SQL

loading data into delta lake from azure blob storage

Databricks Delta Lake - Reading data from JSON file

Spark Data writing in Delta format

How to use pyspark.rdd to combine the data format like ("word", (1, 2))?

Pyspark Delta lake Catching Table is not a delta table exception

Koalas / pyspark Failed to find data source: delta

Using Azure Synapse pyspark filter or flatten the nested json objects based on nested object's data type

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Efficient reading/transforming partitioned data in delta lake pyspark write stream to delta no data pyspark delta lake optimize - fails to parse SQL loading data into delta lake from azure blob storage Databricks Delta Lake - Reading data from JSON file Spark Data writing in Delta format How to use pyspark.rdd to combine the data format like ("word", (1, 2))? Pyspark Delta lake Catching Table is not a delta table exception Koalas / pyspark Failed to find data source: delta Using Azure Synapse pyspark filter or flatten the nested json objects based on nested object's data type

Related Tags

Combine batch data to delta format in a data lake using synapse and pyspark?

Question

1 answers

solution1 1 ACCPTED 2022-07-10 20:02:35

solution1
1 ACCPTED 2022-07-10 20:02:35