[英]I'm getting continuous blob files in blob storage. I have to load in Databricks and put in Azure SQL DB. Data factory for orchestrating this pipeline
I receive data continuously in blob storage.我在 blob 存储中连续接收数据。 I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB.
我最初在 blob 存储中有 5 个 blob 文件,我可以使用 Databricks 从 blob 加载到 Azure SQL DB 并使用数据工厂将其自动化,但问题是当新文件进入 blob 存储时,databricks 将这些文件与旧文件一起加载文件并将其发送到 Azure SQL DB。 I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.
我不想要这些旧文件,每次我只想要更新的文件,这样相同的数据就不会一次又一次地加载到 Azure SQL DB 中。
Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder
.最简单的方法是简单地将刚刚读入的文件归档到一个名为
archiveFolder
的新文件夹中。 Say, your databricks is reading from the following directory:说,你的数据块正在从以下目录读取:
mnt
sourceFolder
file1.txt
file2.txt
file3.txt
You run your code, you ingested the files and loaded them in SQL server.您运行您的代码,摄取文件并将它们加载到 SQL 服务器中。 Then what you can simply do is to archive these files (move them from the
sourceFolder
into archiveFolder
. This can simply be done in databricks using the following command然后你可以简单地做的是归档这些文件(将它们从
sourceFolder
移动到archiveFolder
。这可以使用以下命令在数据块中简单地完成
dbutils.fs.mv(sourcefilePath, archiveFilePath, True)
So, next time your code runs, you will only have the new files in your sourceFolder
.因此,下次您的代码运行时,您的
sourceFolder
中将只有新文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.