简体繁体 English

Azure - 为存储容器中的每个新 blob 触发 Databricks 笔记本

[英]Azure - Trigger Databricks notebook for each new blob in Storage container

原文 2020-12-07 19:23:28 3 2 azure/ azure-storage-blobs/ azure-databricks/ azure-eventgrid/ azure-data-factory-pipeline

I am implementing one testing solution as:我正在实施一种测试解决方案：

I have created an Azure databricks notebook in Python.我在 Python 中创建了一个 Azure databricks 笔记本。 This notebook is performing following tasks (for testing)-此笔记本正在执行以下任务（用于测试）-

Read blob file from Storage account in a Pyspark dataframe.从 Pyspark dataframe 中的存储帐户读取 blob 文件。
Doing some transformation and analysis on it.对其进行一些改造和分析。
Creating CSV with transformed data and storing in a different container.使用转换后的数据创建 CSV 并存储在不同的容器中。
Move original read CSV to different archive container (so that it should not be picked up in next execution).将原始读取 CSV 移动到不同的存档容器（以便在下次执行时不会被拾取）。

*Above steps can be done in different Notebooks also. *以上步骤也可以在不同的笔记本上完成。

Now, I need this Notebook to be triggered for each new Blob in a container.现在，我需要为容器中的每个新 Blob 触发此笔记本。 I will implement following orchestration-我将实施以下编排-

New blob in Container -> event to EventGrid topic-> trigger Datafactory pipeline -> execute Databricks Notebook.容器中的新 blob -> 事件到 EventGrid 主题 -> 触发 Datafactory 管道 -> 执行 Databricks Notebook。

We can pass filename as parameter from ADF pipeline to Databricks notebook.我们可以将文件名作为参数从 ADF 管道传递到 Databricks 笔记本。

Looking for some other ways to do the orchestration flow.寻找其他一些方法来进行编排流程。 If above seems correct and more suitable, please mark as answered.如果以上看起来正确且更合适，请标记为已回答。

2 个解决方案

New blob in Container -> event to EventGrid topic-> trigger Datafactory pipeline -> execute Databricks Notebook.容器中的新 blob -> 事件到 EventGrid 主题 -> 触发 Datafactory 管道 -> 执行 Databricks Notebook。

We can pass filename as parameter from ADF pipeline to Databricks notebook.我们可以将文件名作为参数从 ADF 管道传递到 Databricks 笔记本。

Looking for some other ways to do the orchestration flow.寻找其他一些方法来进行编排流程。 If above seems correct and more suitable, please mark as answered.如果以上看起来正确且更合适，请标记为已回答。

You can use this method.您可以使用此方法。 Of course, you can also follow this path:当然，你也可以走这条路：

New blob in Container -> Use built-in event trigger to trigger Datafactory pipeline -> execute Databricks Notebook . New blob in Container -> Use built-in event trigger to trigger Datafactory pipeline -> execute Databricks Notebook 。

I don't think you need to introduce the event grid, because Data Factory comes with triggers for creating events based on blobs.我认为您不需要介绍事件网格，因为数据工厂带有用于创建基于 blob 的事件的触发器。

I got 2 support comments for what I am following for orchestration.我得到了 2 条关于我所关注的编排的支持评论。 // New blob in Container -> event to EventGrid topic-> trigger Datafactory pipeline -> execute Databricks Notebook. // 容器中的新 blob -> 事件到 EventGrid 主题 -> 触发 Datafactory 管道 -> 执行 Databricks Notebook。 // //