简体   繁体   中英

I'm getting continuous blob files in blob storage. I have to load in Databricks and put in Azure SQL DB. Data factory for orchestrating this pipeline

I receive data continuously in blob storage. I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB. I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.

Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder . Say, your databricks is reading from the following directory:

mnt
  sourceFolder
    file1.txt
    file2.txt
    file3.txt

You run your code, you ingested the files and loaded them in SQL server. Then what you can simply do is to archive these files (move them from the sourceFolder into archiveFolder . This can simply be done in databricks using the following command

dbutils.fs.mv(sourcefilePath, archiveFilePath, True)

So, next time your code runs, you will only have the new files in your sourceFolder .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM