简体繁体 English

使用Python从API获取数据并使用Azure Data Factory加载到Azure SQL数据仓库中

[英]Get data from API using Python and load into Azure SQL Data Warehouse using Azure Data Factory

原文 2017-12-13 14:09:34 7 3 python/ azure/ azure-data-factory

I want to create a Data Warehouse in Azure that contains information from several sources. 我想在Azure中创建一个数据仓库，其中包含来自多个来源的信息。 The input data comes from diferent APIS, which I want to access them using python and the output should be stored into the Warehouse. 输入数据来自不同的APIS，我想使用python访问它们，并且输出应存储在Warehouse中。 This process should be updated every day. 此过程应每天更新。

I have read lots of documents from Azure, but I can't understand how I need to design this process. 我已经阅读了很多来自Azure的文档，但是我不明白如何设计此过程。

The first question is: Where should the python processes, to collect the data from the different APIs, be created? 第一个问题是：应该在哪里创建python处理程序以从不同的API收集数据？ In a pipeline of the Azure Data Factory or somewhere else? 在Azure数据工厂的管道中还是其他地方？

Regards 问候

3 个解决方案

With Azure Data Factory, you would connect to the sources using the built-in connectors: https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-overview 使用Azure Data Factory，您可以使用内置连接器连接到源： https : //docs.microsoft.com/zh-cn/azure/data-factory/copy-activity-overview

By using the V2 service in ADF, you would be able to schedule the pipeline to trigger daily at your desired times. 通过使用ADF中的V2服务，您可以安排管道以在所需时间每天触发。

With python you can use the API to create, configure and schedule the data factory pipelines. 借助python，您可以使用API创建，配置和安排数据工厂管道。 There won't be any python code running, data factory is configured only with json files. 不会运行任何python代码，仅使用json文件配置数据工厂。 The Python library will only help you to create these json files in a language you are familiar with, the same goes for .net, powershell and every other supported language. Python库只会帮助您以熟悉的语言创建这些json文件，.net，powershell和所有其他受支持的语言也是如此。 The end result is always a bunch of json files. 最终结果总是一堆json文件。

I dont know the specifics for your case, but in general you need to create linked services, datasets (that will use those linked services), and pipelines that will be a group of logical activities (that will make use of those datasets). 我不了解您的情况的具体细节，但是通常您需要创建链接服务，数据集（将使用那些链接服务）和管道，这些管道将是一组逻辑活动（将使用这些数据集）。

If you are using ADFv1, you can configure the schedule within the dataset's properties and you wont need a gateway as you are not using on-premise data. 如果使用的是ADFv1，则可以在数据集的属性中配置计划，并且由于不需要使用本地数据，因此不需要网关。 If you are using ADFv2, you will need an Azure Integration Runtime (type "managed") and you can configure the schedule with triggers. 如果使用的是ADFv2，则将需要Azure集成运行时（类型为“托管”），并且可以使用触发器配置计划。

I hope I was able to clarify a bit these concepts. 我希望我能够澄清一下这些概念。

Cheers. 干杯。

You have two options: 您有两种选择：

Throw away your Python code and define an HTTP Connector to describe your data movement. 扔掉您的Python代码并定义一个HTTP连接器来描述您的数据移动。 You are also probably going to need a subsequent transformation activity for the "Transform" step of your ETL. 您也可能需要在ETL的“转换”步骤中进行后续转换活动。
Embed your Python code into a custom activity run by Azure Batch . 将您的Python代码嵌入到Azure Batch运行的自定义活动中。 This is a pretty harder and error prone solution. 这是一个非常困难且容易出错的解决方案。