从可下载的 URL 将数据提取到 Blob 存储中，而无需下载文件

Question

I'm trying to ingest data from https://dumps.wikimedia.org/enwiki/20201001/ which is the Wiki dumps into Azure Blob Storage using Python.我正在尝试从https://dumps.wikimedia.org/enwiki/20201001/摄取数据，这是使用 Python 的 Wiki 转储到 Azure Blob 存储。

The file size are around 200-300 MB each but the point is there is so many files and the total size is more than 50 GB.每个文件大小约为 200-300 MB，但关键是文件太多，总大小超过 50 GB。

I don't want to jeopardize my local laptop's storage so I don't want to download the files to the local drive then upload them to Blob Storage.我不想危及我本地笔记本电脑的存储，所以我不想将文件下载到本地驱动器然后将它们上传到 Blob 存储。

Is there any option that I can stream the files from the ULRs to the Blob Storage directly?是否有任何选项可以将文件从 ULR 直接流式传输到 Blob 存储？

Answer 1

您可以创建一个数据工厂，它支持 REST API 作为源类型和 blob 存储作为接收器。

Answer 2

If you're using the package azure-storage-blob 12.5.0 , you can directly use the start_copy_from_url method.如果您使用的是azure-storage-blob 12.5.0 包，则可以直接使用start_copy_from_url方法。 Note that you need to use this method to copy each file at a time.请注意，您需要使用此方法一次复制每个文件。

Here is the sample code:这是示例代码：

from azure.storage.blob import BlobServiceClient

CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxx;EndpointSuffix=core.windows.net"

def run_sample():
    blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
    source_blob = "http://www.gutenberg.org/files/59466/59466-0.txt"
    copied_blob = blob_service_client.get_blob_client("your_container_name", '59466-0.txt')
    
    #note: the method returns immediately when the copy is in progress, you need to check the copy status as per the official doc mentioned below.
    copied_blob.start_copy_from_url(source_blob)

if __name__ == "__main__":
    run_sample()

For more details, please refer to the completed sample in github.更多细节请参考github中完成的示例。

从可下载的 URL 将数据提取到 Blob 存储中，而无需下载文件

问题描述

2 个解决方案

解决方案1
0 2020-10-18 15:42:51

解决方案2
0 已采纳 2020-10-19 08:04:13

从可下载的 URL 将数据提取到 Blob 存储中，而无需下载文件

问题描述

2 个解决方案

解决方案1 0 2020-10-18 15:42:51

解决方案2 0 已采纳 2020-10-19 08:04:13

解决方案1
0 2020-10-18 15:42:51

解决方案2
0 已采纳 2020-10-19 08:04:13