[英]Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory
I am trying to establish an Azure Data Factory copy data pipeline.我正在尝试建立 Azure 数据工厂复制数据管道。 The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip ).
源是一个开放的 HTTP 链接源(Url 参考: https://clinicaltrials.gov/AllPublicXML.zip )。 So basically the source contains a zipped folder having many XML files.
所以基本上源包含一个压缩文件夹,其中包含许多 XML 文件。 I want to unzip and save the extracted XML files in Azure Blob Storage using Azure Data Factory.
我想使用 Azure 数据工厂将提取的 XML 文件解压缩并保存在 Azure Blob 存储中。 I was trying to follow the configurations mentioned here: How to decompress a zip file in Azure Data Factory v2 but I am getting the following error:
我试图遵循此处提到的配置: How to decompress a zip file in Azure Data Factory v2但我收到以下错误:
ErrorCode=UserErrorSourceNotSeekable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Your HttpServer source can't support random read which is requied by current copy activity setting, please create two copy activities to work around it: the first copy activity binary copy your HttpServer source to a staging file store(like Azure Blob, Azure Data Lake, File, etc.), second copy activity copy from the staged file store to your destination with current settings.,Source=Microsoft.DataTransfer.ClientLibrary,'
Not exactly sure what is going wrong, but it would be really helpful if someone can guide me with the procedure.不完全确定出了什么问题,但是如果有人可以指导我进行该程序,那将非常有帮助。
I broke this up in to two Copy data activities in order to separate the donwloading of the zip file (which is quite large) and the unpacking.我将其分解为两个复制数据活动,以便将 zip 文件(非常大)的下载和解包分开。 You could try and do them in one step but I think you're going to run into timeout issues.
您可以尝试一步完成,但我认为您会遇到超时问题。 With my approach you also have a copy of the original zip file which would be good for audit trail and debugging purposes.
使用我的方法,您还可以获得原始 zip 文件的副本,这将有利于审计跟踪和调试目的。
I try and document my ADF patterns in a boxes and lines format which shows the key details for each component.我尝试以方框和线条格式记录我的 ADF 模式,其中显示了每个组件的关键细节。 So here there are two Copy activities, and the supporting linked services and datasets - try and follow this, let me know how you get on:
所以这里有两个复制活动,以及支持的链接服务和数据集 - 尝试遵循这个,让我知道你的进展情况:
NB it took quite a long time for ADF to unpack the.xml files as there are rather a lot of them.请注意,ADF 需要很长时间才能解压缩 .xml 文件,因为它们的数量相当多。 My results showing in Azure Storage Explorer:
我在 Azure 存储资源管理器中显示的结果:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.