简体   繁体   English

从可下载的 URL 将数据提取到 Blob 存储中,而无需下载文件

[英]Ingest data into Blob Storage from downloadable URLs without having to download the files

I'm trying to ingest data from https://dumps.wikimedia.org/enwiki/20201001/ which is the Wiki dumps into Azure Blob Storage using Python.我正在尝试从https://dumps.wikimedia.org/enwiki/20201001/摄取数据,这是使用 Python 的 Wiki 转储到 Azure Blob 存储。

The file size are around 200-300 MB each but the point is there is so many files and the total size is more than 50 GB.每个文件大小约为 200-300 MB,但关键是文件太多,总大小超过 50 GB。

I don't want to jeopardize my local laptop's storage so I don't want to download the files to the local drive then upload them to Blob Storage.我不想危及我本地笔记本电脑的存储,所以我不想将文件下载到本地驱动器然后将它们上传到 Blob 存储。

Is there any option that I can stream the files from the ULRs to the Blob Storage directly?是否有任何选项可以将文件从 ULR 直接流式传输到 Blob 存储?

您可以创建一个数据工厂,它支持 REST API 作为源类型和 blob 存储作为接收器。

If you're using the package azure-storage-blob 12.5.0 , you can directly use the start_copy_from_url method.如果您使用的是azure-storage-blob 12.5.0 包,则可以直接使用start_copy_from_url方法。 Note that you need to use this method to copy each file at a time.请注意,您需要使用此方法一次复制每个文件。

Here is the sample code:这是示例代码:

from azure.storage.blob import BlobServiceClient

CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxx;EndpointSuffix=core.windows.net"

def run_sample():
    blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
    source_blob = "http://www.gutenberg.org/files/59466/59466-0.txt"
    copied_blob = blob_service_client.get_blob_client("your_container_name", '59466-0.txt')
    
    #note: the method returns immediately when the copy is in progress, you need to check the copy status as per the official doc mentioned below.
    copied_blob.start_copy_from_url(source_blob)

if __name__ == "__main__":
    run_sample()

For more details, please refer to the completed sample in github.更多细节请参考github中完成的示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM