I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a BlobInputStream
is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. A BlobOutputStream
, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:
CloudBlockBlob
object, open it's BlobInputStream
and essentially get back an InputStream
that was 'tied' to the CloudBlockBlob
. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).CloudBlockBlob
object that I am uploading to, get it's BlobOutputStream
, and write to it the data I did some operations on. A good example of this is if I wanted to compress a file. I had a GzipStreamReader
class that would accept an BlobInputStream
and an BlobOutputStream
. It would read data from the BlobInputStream
and, whenever it has compressed some amount of data, write to the BlobOutputStream
. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.
Now, the Python SDK is a little different, and obviously for good reason; the io
module works differently than Java's InputStream
and OutputStream
classes (which the Blob{Input/Output}Stream
classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader
is what holds the 'connection' to the BlockBlob
object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO()
and pass that stream to the StorageStreamDownloader
's readinto method.
For uploads, I would call the BlobClient 's upload method . The upload method accepts a data
parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]]
.
I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:
StorageStreamDownloader
with all the data in the blob. Some investigation has shown that I can use the offset
and length
to download the amount of data I want. Perhaps I can call it once with a download_blob(offset=0, length=4MB)
, process the data I get back, then again call download_bloc(offset=4MB, length=4MB)
, process the data, etc. This is unfavorable. The other thing I could do is utilize the max_chunk_get_size
parameter for the BlobClient
and turn on the validate_content
flag (make it true) so that the StorageStreamDownloader
only downloads 4mb. But this all results in several problems: that's not really streaming from a stream
object. I'll still have to call download
and readinto
several times. And fine, I would do that, if it weren't for the second problem:BlockBlobs
. The docs for the upload_function
function say that I can provide a param overwrite
that does: keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data. If True, upload_blob will overwrite the existing data. If set to False, the operation will fail with ResourceExistsError. The exception to the above is with Append blob types: if set to False and the data already exists, an error will not be raised and the data will be appended to the existing blob. If set overwrite=True, then the existing append blob will be deleted, and a new one created. Defaults to False.
BlockBlobs
, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right? Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the io
module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.
Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:
This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
I hope this makes sense. I just want to stream data. This shouldn't be that hard.
Edit:
Some people may want to close this and claim the question is a duplicate. I have forgotten to include something very important: I am currently using the newest , mot up-to-date azure-sdk version. My azure-storage-blob
package's version is 12.5.0
. There have been other questions similar to what I have asked for severely outdated versions. I have searched for other answers, but haven't found any for 12+
versions.
If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code
import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'
source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='',
blob_name='',
credential=key,
max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
max_single_get_size=4*1024*1024)
des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
container_name='',
blob_name='',
credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
#process your data
# use the put block rest api to upload the chunk to azure storage
blk_id = str(uuid.uuid4())
des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
block_list.append(BlobBlock(block_id=blk_id))
#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)
Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.