简体   繁体   中英

Combine batch data to delta format in a data lake using synapse and pyspark?

I currently have a data lake with several daily interval tables of data in the bronze layer of a data lake. They are in csv format and regularly new daily csv tables are ingested to the bronze folder.

I would like to transform them eg editing some rows,change column names and save to delta format in the silver layer. What would be best practice while using Synapse Analytics and pyspark? I have used synapse notebook so far for transformation but with my limited pyspark knowledge I could only cleanse the data and save each daily table in delta format. What would be the code like for filtering the relevant data file names in a folder, only take the newest tables and combine to 1 delta table in the silver layer folder? In my search I only found streaming data and autoloader for databricks but I don't think synapse posses of autoloader functionality yet. Or how else would a delta lake handle these kind of scenarios?

Thanks and best regards!

I am doing something similar. I ingest CSV files for different entities into Delta Lake bronze layer. For each entity I have a separate delta lake table. In the bronze layer I add additional columns as audit columns to capture filename, fileloaddate etc. Then use the latest fileloaddate to load data from bronze to silver. Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM