I receive data continuously in blob storage. I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB. I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.
Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder
. Say, your databricks is reading from the following directory:
mnt
sourceFolder
file1.txt
file2.txt
file3.txt
You run your code, you ingested the files and loaded them in SQL server. Then what you can simply do is to archive these files (move them from the sourceFolder
into archiveFolder
. This can simply be done in databricks using the following command
dbutils.fs.mv(sourcefilePath, archiveFilePath, True)
So, next time your code runs, you will only have the new files in your sourceFolder
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.