简体   繁体   English

如何使用 StorageStreamDownloader 从一个 blob 下载 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 并将 stream 上传到另一个 blob?

[英]How can one use the StorageStreamDownloader to stream download from a blob and stream upload to a different blob?

I believe I have a very simple requirement for which a solution has befuddled me.我相信我有一个非常简单的要求,解决方案让我感到困惑。 I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.我是azure-python-sdk的新手,并且在其新的 blob 流功能方面收效甚微。

Some context一些上下文

I have used the Java SDK for several years now.我已经使用 Java SDK 好几年了。 Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object.每个CloudBlockBlob object 都有一个BlobInputStream和一个BlobOutputStream object。 When a BlobInputStream is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion.当打开BlobInputStream时,可以调用它的许多函数(最显着的是它的read()函数)来以真正的流式方式检索数据。 A BlobOutputStream , once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked.一个BlobOutputStream ,一旦被检索,就会有一个write(byte[] data) function ,其中人们可以根据需要不断地写入数据,直到调用close() function。 So, it was very easy for me to:所以,我很容易做到:

  1. Get a CloudBlockBlob object, open it's BlobInputStream and essentially get back an InputStream that was 'tied' to the CloudBlockBlob .获取CloudBlockBlob object,打开它的BlobInputStream并基本上取回一个与CloudBlockBlob “绑定”的InputStream It usually maintained 4MB of data - at least, that's what I understood.它通常维护 4MB 的数据——至少,我是这么理解的。 When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).当从其缓冲区中读取一定数量的数据时,会引入新的(相同数量的)数据,因此它始终具有大约 4MB 的新数据(直到检索到所有数据)。
  2. Perform some operations on that data.对该数据执行一些操作。
  3. Retrieve the CloudBlockBlob object that I am uploading to, get it's BlobOutputStream , and write to it the data I did some operations on.检索我要上传到的CloudBlockBlob object,获取它的BlobOutputStream ,并将我对其进行了一些操作的数据写入其中。

A good example of this is if I wanted to compress a file.一个很好的例子是如果我想压缩一个文件。 I had a GzipStreamReader class that would accept an BlobInputStream and an BlobOutputStream .我有一个可以接受GzipStreamReader和 BlobOutputStream 的BlobInputStream BlobOutputStream It would read data from the BlobInputStream and, whenever it has compressed some amount of data, write to the BlobOutputStream .它会从BlobInputStream读取数据,并在压缩了一定数量的数据后写入BlobOutputStream It could call write() as many times as it wished;它可以根据需要多次调用 write(); when it finishes reading all the daya, it would close both Input and Output streams, and all was good.当它完成一整天的阅读时,它将关闭 Input 和 Output 流,一切都很好。

Now for Python现在为 Python

Now, the Python SDK is a little different, and obviously for good reason;现在,Python SDK 有点不同,显然是有充分理由的; the io module works differently than Java's InputStream and OutputStream classes (which the Blob{Input/Output}Stream classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and pass that stream to the StorageStreamDownloader 's readinto method. io模块的工作方式与 Java 的InputStreamOutputStream类( Blob{Input/Output}Stream类继承自它们)不同。我一直在努力理解流在 Azure 的 python ZF20E3C5E54C3AB3D376D60 中的真正工作原理。 how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and将 stream 传递给StorageStreamDownloaderreadinto方法。

For uploads, I would call the BlobClient 's upload method .对于上传,我会调用BlobClient上传方法 The upload method accepts a data parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]] . upload 方法接受一个类型为Union[Iterable[AnyStr], IO[AnyStr]]data参数。

I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere.我不想 go 对我所理解的内容过于详细,因为我所理解的和我所做的一切都让我无处可去。 I am suspicious that I am expecting something that only the Java SDK offers.我怀疑我期待的东西只有 Java SDK 提供。 But, overall, here are the problems I am having:但是,总的来说,这是我遇到的问题:

  1. When I call download_blob , I get back a StorageStreamDownloader with all the data in the blob.当我调用download_blob时,我会返回一个包含 blob 中所有数据的StorageStreamDownloader Some investigation has shown that I can use the offset and length to download the amount of data I want.一些调查表明,我可以使用offsetlength来下载我想要的数据量。 Perhaps I can call it once with a download_blob(offset=0, length=4MB) , process the data I get back, then again call download_bloc(offset=4MB, length=4MB) , process the data, etc. This is unfavorable.也许我可以用download_blob(offset=0, length=4MB)调用一次,处理我返回的数据,然后再次调用download_bloc(offset=4MB, length=4MB) ,处理数据等。这是不利的。 The other thing I could do is utilize the max_chunk_get_size parameter for the BlobClient and turn on the validate_content flag (make it true) so that the StorageStreamDownloader only downloads 4mb.我可以做的另一件事是利用BlobClientmax_chunk_get_size参数并打开validate_content标志(使其为真),以便StorageStreamDownloader仅下载 4mb。 But this all results in several problems: that's not really streaming from a stream object.但这一切都会导致几个问题:这并不是真正从stream object 流式传输的。 I'll still have to call download and readinto several times.我仍然需要多次调用downloadreadinto And fine, I would do that, if it weren't for the second problem:好吧,如果不是第二个问题,我会这样做:
  2. How the heck do I stream an upload?我该怎么上传 stream ? The upload can take a stream.上传可以采用 stream。 But if the stream doesn't auto-update itself, then I can only upload once, because all the blobs I deal with must be BlockBlobs .但是如果 stream 没有自动更新,那么我只能上传一次,因为我处理的所有 blob 都必须BlockBlobs The docs for the upload_function function say that I can provide a param overwrite that does: upload_function function 的文档说我可以提供一个参数overwrite

keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.关键字 bool overwrite:要上传的 blob 是否应该覆盖当前数据。 If True, upload_blob will overwrite the existing data.如果为 True,upload_blob 将覆盖现有数据。 If set to False, the operation will fail with ResourceExistsError.如果设置为 False,则操作将失败并出现 ResourceExistsError。 The exception to the above is with Append blob types: if set to False and the data already exists, an error will not be raised and the data will be appended to the existing blob.上述情况的例外是 Append blob 类型:如果设置为 False 并且数据已经存在,则不会引发错误,并且数据将附加到现有 blob。 If set overwrite=True, then the existing append blob will be deleted, and a new one created.如果设置 overwrite=True,则现有的 append blob 将被删除,并创建一个新 blob。 Defaults to False.默认为假。

  1. And this makes sense because BlockBlobs , once written to, cannot be written to again.这是有道理的,因为BlockBlobs一旦被写入,就不能再次被写入。 So AFAIK, you can't 'stream' an upload.因此,AFAIK,您不能“流式传输”上传。 If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?如果我不能有一个直接绑定到 blob 或保存所有数据的 stream object,那么 upload() function 将在完成后立即终止,对吗?

Okay.好的。 I am certain I am missing something important.我确定我错过了一些重要的东西。 I am also somewhat ignorant when it comes to the io module in Python.对于Python中的io模块,我也有些无知。 Though I have developed in Python for a long time, I never really had to deal with that module too closely.虽然我已经在 Python 中开发了很长时间,但我从来没有真正需要过于密切地处理那个模块。 I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.我确信我遗漏了一些东西,因为这个功能非常基本,并且存在于我所知道的所有其他 azure SDK 中。

To recap回顾一下

Everything I said above can honestly be ignored, and only this portion read;老实说,我上面所说的一切都可以忽略,只阅读这部分; I am just trying to show I've done some due diligence.我只是想表明我已经做了一些尽职调查。 I want to know how to stream data from a blob, process the data I get in a stream, then upload that data.我想知道如何从 blob 中获取 stream 数据,处理我在 stream 中获得的数据,然后上传该数据。 I cannot be receiving all the data in a blob at once.我不能一次接收一个blob中的所有数据。 Blobs are likely to be over 1GB and all that pretty stuff. Blob 可能超过 1GB 和所有漂亮的东西。 I would honestly love some example code that shows:老实说,我会喜欢一些示例代码,这些示例代码显示:

  1. Retrieving some data from a blob (the data received in one call should not be more than 10MB) in a stream.从 stream 中的 blob 中检索一些数据(一次调用中接收的数据不应超过 10MB)。
  2. Compressing the data in that stream.压缩 stream 中的数据。
  3. Upload the data to a blob.将数据上传到 blob。

This should work for blobs of all sizes;这应该适用于所有大小的 blob; whether its 1MB or 10MB or 10GB should not matter.它的 1MB 或 10MB 或 10GB 都无关紧要。 Step 2 can be anything really;第 2 步可以是任何东西; it can also be nothing.它也可以什么都不是。 Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great.只要在下载数据,插入stream,然后上传,那就太好了。 Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.当然,另一个极其重要的限制是每次“下载”的数据量不应超过 10MB。

I hope this makes sense.我希望这是有道理的。 I just want to stream data.我只想 stream 数据。 This shouldn't be that hard.这不应该那么难。

Edit:编辑:

Some people may want to close this and claim the question is a duplicate.有些人可能想关闭它并声称该问题是重复的。 I have forgotten to include something very important: I am currently using the newest , mot up-to-date azure-sdk version.我忘记包含一些非常重要的内容:我目前使用的是最新的、mot 最新的 azure-sdk 版本。 My azure-storage-blob package's version is 12.5.0 .我的azure-storage-blob包的版本是12.5.0 There have been other questions similar to what I have asked for severely outdated versions.还有其他问题类似于我对严重过时的版本提出的问题。 I have searched for other answers, but haven't found any for 12+ versions.我已经搜索了其他答案,但没有找到12+版本的答案。

If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code如果要在块中下载azure blob,处理每个块数据并将每个块数据上传到azure blob,请参考以下代码

import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'

source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
                                container_name='',
                                blob_name='',
                                credential=key,
                                max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
                                max_single_get_size=4*1024*1024)

des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
                             container_name='',
                             blob_name='',
                             credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
    #process your data 
    
    # use the put block rest api to upload the chunk to azure storage
    blk_id = str(uuid.uuid4())
    des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
    block_list.append(BlobBlock(block_id=blk_id))

#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)

Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url另外,如果你只想将一个blob从一个存储位置复制到另一个存储位置,你可以直接使用start_copy_from_url方法在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM