简体   繁体   English

动态 REST 调用 Azure Synapse Pipeline

[英]Dynamic REST calls in Azure Synapse Pipeline

I am making a call to a REST API with Azure Synapse and the return dataset looks something like this:我正在使用 Azure Synapse 调用 REST API,返回的数据集如下所示:

{
"links": [
    {
        "rel": "next",
        "href": "[myRESTendpoint]?limit=1000&offset=1000"
    },
    {
        "rel": "last",
        "href": "[myRESTendpoint]?limit=1000&offset=60000"
    },
    {
        "rel": "self",
        "href": "[myRESTendpoint]"
    }
],
"count": 1000,
"hasMore": true,
"items": [
    {
        "links": [],
        "closedate": "6/16/2014",
        "id": "16917",
        "number": "62000",
        "status": "H",
        "tranid": "0062000"
    },...
],
"offset": 0,
"totalResults": 60316
}

I am familiar with making a REST call to a single endpoint that can return all the the data with a single call using a Synapse pipeline, but this particular REST endpoint has a hard limit on only returning 1000 records, but it does give a property named "hasMore".我熟悉对单个端点进行 REST 调用,该端点可以使用 Synapse 管道通过一次调用返回所有数据,但是这个特定的 REST 端点对仅返回 1000 条记录有硬性限制,但它确实提供了一个名为“有更多的”。

Is there a way to recursively make rest calls in a Synapse pipeline until the "hasMore" property equals false?有没有办法在 Synapse 管道中递归调用 rest 直到“hasMore”属性等于 false?

The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there.这样做的最终目标是将数据接收到专用的 SQL 池或 ADLS2 并从那里进行转换。

I have tried to achieve the same scenario using Azure Data Factory which seems to be more appropriate and easy to achieve the goal "The end goal of this is to sink data to either a dedicated SQL pool or into ADLS2 and transform from there".我尝试使用 Azure 数据工厂来实现相同的场景,这似乎更合适也更容易实现“最终目标是将数据接收到专用的 SQL 池或 ADLS2 并从那里转换”的目标。

As you have to hit the page recursively to fetch 1000 records, you might set it in the following fashion if the response header/response body contain the URL for the next page.由于您必须递归地访问页面以获取 1000 条记录,如果响应标头/响应正文包含下一页的 URL,则可以按以下方式设置它。 在此处输入图像描述 在此处输入图像描述

You're less likely to be able to use the functionality if the next page link or query parameter isn't included in the response headers/body.如果下一页链接或查询参数未包含在响应标头/正文中,则您不太可能能够使用该功能。

Alternatively, you may utilise loop logic and do the Copy Activity.或者,您可以利用循环逻辑并执行复制活动。

Create two parameters in the Rest Connector:在Rest Connector中创建两个参数:

在此处输入图像描述

Fill in the parameters for the RestConnector's relative URL.填写RestConnector相对URL的参数。

在此处输入图像描述

Using the Set Variable action, the value of this variable would be increased in a loop.使用设置变量操作,该变量的值将在循环中增加。 For each cycle, the URL for the Copy Activity is dynamically set.If you want to loop or iterate, you may use the Until activity.对于每个循环,Copy Activity 的 URL 是动态设置的。如果要循环或迭代,可以使用 Until Activity。

Alternative:选择:

In my experience, the REST connection pagination is quite rigid.根据我的经验,REST 连接分页非常死板。 Usually put the action within a loop.通常将动作放在一个循环中。 As a result, to have more control.结果,有更多的控制权。 FOREACH Loop, here FOREACH循环, 在这里

For those following the thread, I used IpsitaDash-MT's suggestion using the ForEach loop.对于关注该主题的人,我使用了 IpsitaDash-MT 的建议,使用 ForEach 循环。 In the case of this API, when a call is made I get a property returned at the end of the call named "totalResults".在这个 API 的情况下,当进行调用时,我得到一个名为“totalResults”的调用结束时返回的属性。 Here are the steps I used to achieve what I was looking to do:以下是我用来实现我想要做的事情的步骤:

  1. Make a dummy call to the API to get the "totalResults" parameter.对 API 进行虚拟调用以获取“totalResults”参数。 This is just a call to return the number of results I am looking to get.这只是一个返回我希望获得的结果数量的调用。 In the case of this API, the body of the request is a SQL statement, so when the dummy request is made I am only asking for the ID's of the results I am looking to get.在这个 API 的例子中,请求的主体是 SQL 语句,所以当发出虚拟请求时,我只询问我想要获得的结果的 ID。

SQL statement example SQL 语句示例

  1. I then take the property "totalResults" from that request set a dynamic value in the "Items" of the ForEach loop like this:然后我从该请求中获取属性“totalResults”,在 ForEach 循环的“Items”中设置一个动态值,如下所示:

    @range(0,add(div(sub(int(activity('Get Pages Customers').output.totalResults),mod(int(activity('Get Pages Customers').output.totalResults),1000)),1000),1)) @range(0,add(div(sub(int(activity('Get Pages Customers').output.totalResults),mod(int(activity('Get Pages Customers').output.totalResults),1000),1000 ),1))

NOTE: The API only allows pages of 1000 results, I do some math to get a range of page numbers.注意: API 只允许 1000 个结果的页面,我做了一些数学运算以获得页码范围。 I also have to add 1 to the final result to include the last page.我还必须将 1 添加到最终结果以包含最后一页。

ForEach Loop Settings ForEach 循环设置

  1. In the API I have two parameters that can be passed "limit" and "offset".在 API 中,我有两个参数可以传递“limit”和“offset”。 Since I want all of the data there is no reason to have limit set to anything other than 1000 (the max allowable number).因为我想要所有数据,所以没有理由将限制设置为 1000(最大允许数量)以外的任何值。 The offset parameter can be set to any number less than or equal to "totalResults" - "limit" and greater than or equal to 0. So I use the range established in step 2 and multiply it out by 1000 to set the offset parameter in the URL. offset参数可以设置为小于等于“totalResults”-“limit”且大于等于0的任意数。所以我使用步骤2中建立的范围乘以1000来设置offset参数在URL。

Setting the offset parameter in the copy data activity在复制数据活动中设置偏移参数

Dynamic value of the Relative URL in the REST connector REST连接器中Relative URL的动态值

NOTE: I found it better to sink the data as JSON into ADLS2 first rather than into a dedicated SQL pool due to the Lookup feature.注意:由于查找功能,我发现最好先将数据作为 JSON 放入 ADLS2,而不是放入专用的 SQL 池。

  1. Since synapse does not allow nested ForEach loops, I run the data through a data flow to format the data and check for duplicates and updates.由于突触不允许嵌套的 ForEach 循环,我通过数据流运行数据以格式化数据并检查重复项和更新。

  2. When the data flow is completed it kicks off a lookup activity to get the data that was just processed and pass it into a new pipeline to use another ForEach loop to get the child data for each ID of parent data.数据流完成后,它会启动查找活动以获取刚刚处理的数据并将其传递到新管道以使用另一个 ForEach 循环为父数据的每个 ID 获取子数据。

Data Flow and Lookup for child data pipeline子数据管道的数据流和查找

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Terraform 创建 Azure 突触管道 - Create Azure Synapse Pipeline using Terraform azure 突触中 spark notebook 管道中的文件路径错误 - File path error in pipeline for spark notebook in azure synapse 如何在 Azure Synapse 或数据工厂管道中设置和获取变量值 - How to set and get variable value in Azure Synapse or Data Factory pipeline Azure 监控工作簿查询 - 从 Azure REST API Synapse 数据平面检索数据 - Azure Monitor Workbook Query - retrieve data from Azure REST API Synapse Data Plane 我们可以从管道 azure 突触在无服务器池中执行 sql 查询吗? - can we execute sql query in serverless pool from pipeline azure synapse? Azure Synapse 出现问题:管道无法执行在 Develop 脚本中运行的存储过程 - Trouble with Azure Synapse: pipeline cannot execute a stored procedure that works in Develop script Azure Synapse Pipeline 从 BigQuery 复制数据,其中源模式是带有嵌套列的分层结构 - Azure Synapse Pipeline copy data from the BigQuery, where the source schema is hierarchical with nested columns Azure 突触等效命令 - Azure synapse equivalent commands Azure 突触部署失败 - Azure synapse deployment failing Azure Synapse web 调用“pipelineruns”的活动 rest api,失败并显示“failureType”:“UserError” - Azure Synapse web activity calling "pipelineruns" rest api , fails with "failureType": "UserError"
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM