[英]Ingest data into Blob Storage from downloadable URLs without having to download the files
I'm trying to ingest data from https://dumps.wikimedia.org/enwiki/20201001/
which is the Wiki dumps into Azure Blob Storage using Python.我正在尝试从
https://dumps.wikimedia.org/enwiki/20201001/
摄取数据,这是使用 Python 的 Wiki 转储到 Azure Blob 存储。
The file size are around 200-300 MB each but the point is there is so many files and the total size is more than 50 GB.每个文件大小约为 200-300 MB,但关键是文件太多,总大小超过 50 GB。
I don't want to jeopardize my local laptop's storage so I don't want to download the files to the local drive then upload them to Blob Storage.我不想危及我本地笔记本电脑的存储,所以我不想将文件下载到本地驱动器然后将它们上传到 Blob 存储。
Is there any option that I can stream the files from the ULRs to the Blob Storage directly?是否有任何选项可以将文件从 ULR 直接流式传输到 Blob 存储?
您可以创建一个数据工厂,它支持 REST API 作为源类型和 blob 存储作为接收器。
If you're using the package azure-storage-blob 12.5.0 , you can directly use the start_copy_from_url
method.如果您使用的是azure-storage-blob 12.5.0 包,则可以直接使用
start_copy_from_url
方法。 Note that you need to use this method to copy each file at a time.请注意,您需要使用此方法一次复制每个文件。
Here is the sample code:这是示例代码:
from azure.storage.blob import BlobServiceClient
CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxx;EndpointSuffix=core.windows.net"
def run_sample():
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
source_blob = "http://www.gutenberg.org/files/59466/59466-0.txt"
copied_blob = blob_service_client.get_blob_client("your_container_name", '59466-0.txt')
#note: the method returns immediately when the copy is in progress, you need to check the copy status as per the official doc mentioned below.
copied_blob.start_copy_from_url(source_blob)
if __name__ == "__main__":
run_sample()
For more details, please refer to the completed sample in github.更多细节请参考github中完成的示例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.