[英]How can one use the StorageStreamDownloader to stream download from a blob and stream upload to a different blob?
I believe I have a very simple requirement for which a solution has befuddled me.我相信我有一个非常简单的要求,解决方案让我感到困惑。 I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
我是azure-python-sdk的新手,并且在其新的 blob 流功能方面收效甚微。
I have used the Java SDK for several years now.我已经使用 Java SDK 好几年了。 Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object.
每个CloudBlockBlob object 都有一个BlobInputStream和一个BlobOutputStream object。 When a
BlobInputStream
is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion.当打开
BlobInputStream
时,可以调用它的许多函数(最显着的是它的read()函数)来以真正的流式方式检索数据。 A BlobOutputStream
, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked.一个
BlobOutputStream
,一旦被检索,就会有一个write(byte[] data) function ,其中人们可以根据需要不断地写入数据,直到调用close() function。 So, it was very easy for me to:所以,我很容易做到:
CloudBlockBlob
object, open it's BlobInputStream
and essentially get back an InputStream
that was 'tied' to the CloudBlockBlob
.CloudBlockBlob
object,打开它的BlobInputStream
并基本上取回一个与CloudBlockBlob
“绑定”的InputStream
。 It usually maintained 4MB of data - at least, that's what I understood.CloudBlockBlob
object that I am uploading to, get it's BlobOutputStream
, and write to it the data I did some operations on.CloudBlockBlob
object,获取它的BlobOutputStream
,并将我对其进行了一些操作的数据写入其中。 A good example of this is if I wanted to compress a file.一个很好的例子是如果我想压缩一个文件。 I had a
GzipStreamReader
class that would accept an BlobInputStream
and an BlobOutputStream
.我有一个可以接受
GzipStreamReader
和 BlobOutputStream 的BlobInputStream
BlobOutputStream
。 It would read data from the BlobInputStream
and, whenever it has compressed some amount of data, write to the BlobOutputStream
.它会从
BlobInputStream
读取数据,并在压缩了一定数量的数据后写入BlobOutputStream
。 It could call write() as many times as it wished;它可以根据需要多次调用 write(); when it finishes reading all the daya, it would close both Input and Output streams, and all was good.
当它完成一整天的阅读时,它将关闭 Input 和 Output 流,一切都很好。
Now, the Python SDK is a little different, and obviously for good reason;现在,Python SDK 有点不同,显然是有充分理由的; the
io
module works differently than Java's InputStream
and OutputStream
classes (which the Blob{Input/Output}Stream
classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader
is what holds the 'connection' to the BlockBlob
object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO()
and pass that stream to the StorageStreamDownloader
's readinto method. io
模块的工作方式与 Java 的InputStream
和OutputStream
类( Blob{Input/Output}Stream
类继承自它们)不同。我一直在努力理解流在 Azure 的 python ZF20E3C5E54C3AB3D376D60 中的真正工作原理。 how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader
is what holds the 'connection' to the BlockBlob
object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO()
and将 stream 传递给StorageStreamDownloader
的readinto方法。
For uploads, I would call the BlobClient 's upload method .对于上传,我会调用BlobClient的上传方法。 The upload method accepts a
data
parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]]
. upload 方法接受一个类型为
Union[Iterable[AnyStr], IO[AnyStr]]
的data
参数。
I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere.我不想 go 对我所理解的内容过于详细,因为我所理解的和我所做的一切都让我无处可去。 I am suspicious that I am expecting something that only the Java SDK offers.
我怀疑我期待的东西只有 Java SDK 提供。 But, overall, here are the problems I am having:
但是,总的来说,这是我遇到的问题:
StorageStreamDownloader
with all the data in the blob.StorageStreamDownloader
。 Some investigation has shown that I can use the offset
and length
to download the amount of data I want.offset
和length
来下载我想要的数据量。 Perhaps I can call it once with a download_blob(offset=0, length=4MB)
, process the data I get back, then again call download_bloc(offset=4MB, length=4MB)
, process the data, etc. This is unfavorable.download_blob(offset=0, length=4MB)
调用一次,处理我返回的数据,然后再次调用download_bloc(offset=4MB, length=4MB)
,处理数据等。这是不利的。 The other thing I could do is utilize the max_chunk_get_size
parameter for the BlobClient
and turn on the validate_content
flag (make it true) so that the StorageStreamDownloader
only downloads 4mb.BlobClient
的max_chunk_get_size
参数并打开validate_content
标志(使其为真),以便StorageStreamDownloader
仅下载 4mb。 But this all results in several problems: that's not really streaming from a stream
object.stream
object 流式传输的。 I'll still have to call download
and readinto
several times.download
和readinto
。 And fine, I would do that, if it weren't for the second problem:BlockBlobs
.BlockBlobs
。 The docs for the upload_function
function say that I can provide a param overwrite
that does: upload_function
function 的文档说我可以提供一个参数overwrite
:keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.
关键字 bool overwrite:要上传的 blob 是否应该覆盖当前数据。 If True, upload_blob will overwrite the existing data.
如果为 True,upload_blob 将覆盖现有数据。 If set to False, the operation will fail with ResourceExistsError.
如果设置为 False,则操作将失败并出现 ResourceExistsError。 The exception to the above is with Append blob types: if set to False and the data already exists, an error will not be raised and the data will be appended to the existing blob.
上述情况的例外是 Append blob 类型:如果设置为 False 并且数据已经存在,则不会引发错误,并且数据将附加到现有 blob。 If set overwrite=True, then the existing append blob will be deleted, and a new one created.
如果设置 overwrite=True,则现有的 append blob 将被删除,并创建一个新 blob。 Defaults to False.
默认为假。
BlockBlobs
, once written to, cannot be written to again.BlockBlobs
一旦被写入,就不能再次被写入。 So AFAIK, you can't 'stream' an upload. Okay.好的。 I am certain I am missing something important.
我确定我错过了一些重要的东西。 I am also somewhat ignorant when it comes to the
io
module in Python.对于Python中的
io
模块,我也有些无知。 Though I have developed in Python for a long time, I never really had to deal with that module too closely.虽然我已经在 Python 中开发了很长时间,但我从来没有真正需要过于密切地处理那个模块。 I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.
我确信我遗漏了一些东西,因为这个功能非常基本,并且存在于我所知道的所有其他 azure SDK 中。
Everything I said above can honestly be ignored, and only this portion read;老实说,我上面所说的一切都可以忽略,只阅读这部分; I am just trying to show I've done some due diligence.
我只是想表明我已经做了一些尽职调查。 I want to know how to stream data from a blob, process the data I get in a stream, then upload that data.
我想知道如何从 blob 中获取 stream 数据,处理我在 stream 中获得的数据,然后上传该数据。 I cannot be receiving all the data in a blob at once.
我不能一次接收一个blob中的所有数据。 Blobs are likely to be over 1GB and all that pretty stuff.
Blob 可能超过 1GB 和所有漂亮的东西。 I would honestly love some example code that shows:
老实说,我会喜欢一些示例代码,这些示例代码显示:
This should work for blobs of all sizes;这应该适用于所有大小的 blob; whether its 1MB or 10MB or 10GB should not matter.
它的 1MB 或 10MB 或 10GB 都无关紧要。 Step 2 can be anything really;
第 2 步可以是任何东西; it can also be nothing.
它也可以什么都不是。 Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great.
只要在下载数据,插入stream,然后上传,那就太好了。 Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
当然,另一个极其重要的限制是每次“下载”的数据量不应超过 10MB。
I hope this makes sense.我希望这是有道理的。 I just want to stream data.
我只想 stream 数据。 This shouldn't be that hard.
这不应该那么难。
Edit:编辑:
Some people may want to close this and claim the question is a duplicate.有些人可能想关闭它并声称该问题是重复的。 I have forgotten to include something very important: I am currently using the newest , mot up-to-date azure-sdk version.
我忘记包含一些非常重要的内容:我目前使用的是最新的、mot 最新的 azure-sdk 版本。 My
azure-storage-blob
package's version is 12.5.0
.我的
azure-storage-blob
包的版本是12.5.0
。 There have been other questions similar to what I have asked for severely outdated versions.还有其他问题类似于我对严重过时的版本提出的问题。 I have searched for other answers, but haven't found any for
12+
versions.我已经搜索了其他答案,但没有找到
12+
版本的答案。
If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code如果要在块中下载azure blob,处理每个块数据并将每个块数据上传到azure blob,请参考以下代码
import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'
source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='',
blob_name='',
credential=key,
max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
max_single_get_size=4*1024*1024)
des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
container_name='',
blob_name='',
credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
#process your data
# use the put block rest api to upload the chunk to azure storage
blk_id = str(uuid.uuid4())
des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
block_list.append(BlobBlock(block_id=blk_id))
#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)
Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url
另外,如果你只想将一个blob从一个存储位置复制到另一个存储位置,你可以直接使用
start_copy_from_url
方法
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.