我在 blob 存储中获得连续的 blob 文件。我必须加载 Databricks 并放入 Azure SQL DB。用于编排此管道的数据工厂

Question

I receive data continuously in blob storage.我在 blob 存储中连续接收数据。 I have initially 5 blob files in the blob storage I'm able to load from blob to Azure SQL DB using Databricks and automated it using Data factory, But the problem is when newer files come in blob storage the databricks loads these files along with older files and sends it into Azure SQL DB.我最初在 blob 存储中有 5 个 blob 文件，我可以使用 Databricks 从 blob 加载到 Azure SQL DB 并使用数据工厂将其自动化，但问题是当新文件进入 blob 存储时，databricks 将这些文件与旧文件一起加载文件并将其发送到 Azure SQL DB。 I don't want these old files, every time I want only the newer one's, so that same data is not loaded again and again in the Azure SQL DB.我不想要这些旧文件，每次我只想要更新的文件，这样相同的数据就不会一次又一次地加载到 Azure SQL DB 中。

Answer 1

Easiest way to do that is to simply archive the file that you just read into a new folder name it archiveFolder .最简单的方法是简单地将刚刚读入的文件归档到一个名为archiveFolder的新文件夹中。 Say, your databricks is reading from the following directory:说，你的数据块正在从以下目录读取：

mnt
  sourceFolder
    file1.txt
    file2.txt
    file3.txt

You run your code, you ingested the files and loaded them in SQL server.您运行您的代码，摄取文件并将它们加载到 SQL 服务器中。 Then what you can simply do is to archive these files (move them from the sourceFolder into archiveFolder . This can simply be done in databricks using the following command然后你可以简单地做的是归档这些文件（将它们从sourceFolder移动到archiveFolder 。这可以使用以下命令在数据块中简单地完成

dbutils.fs.mv(sourcefilePath, archiveFilePath, True)

So, next time your code runs, you will only have the new files in your sourceFolder .因此，下次您的代码运行时，您的sourceFolder中将只有新文件。

我在 blob 存储中获得连续的 blob 文件。我必须加载 Databricks 并放入 Azure SQL DB。用于编排此管道的数据工厂

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-11-29 17:00:10

我在 blob 存储中获得连续的 blob 文件。 我必须加载 Databricks 并放入 Azure SQL DB。 用于编排此管道的数据工厂

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-11-29 17:00:10

我在 blob 存储中获得连续的 blob 文件。我必须加载 Databricks 并放入 Azure SQL DB。用于编排此管道的数据工厂

解决方案1
1 已采纳 2019-11-29 17:00:10