简体   繁体   中英

How to read a big Azure blob storage file block by block

I want to read a huge Azure blob storage file and stream its content to Event-Hub. I found this example,

from azure.storage.blob import BlockBlobService

bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"

bb.get_blob_to_path(container_name, blob_name_to_download, file_path, open_mode='wb', 
 snapshot=None, start_range=None, end_range=None, validate_content=False, 
 progress_callback=None, max_connections=2, lease_id=None, 
 if_modified_since=None, if_unmodified_since=None, 
 if_match=None, if_none_match=None, timeout=None)

But in this way, you can't get blocks in a loop, which I want to do. So, how can I modify this code for my case?

If you notice, there are two parameters in get_blob_to_path method - start_range and end_range . These two parameters would allow you to read blob's data in chunk.

What you need to do is get the blob's properties first to find its length and then repeatedly call get_blob_xxx method to get data in chunks. I used get_blob_to_text method but you can see other methods here .

Here's the pseudo code I came up with. HTH.

bb = BlockBlobService(account_name='', account_key='')
container_name = ""
blob_name_to_download = "test.txt"
file_path ="/home/Adam/Downloaded_test.txt"

#First get blob properties. We would want to find out blob's content length
blob = bb.get_blob_properties()

#extract content length from blob's properties
blob_size = blob.properties.content_length

#now let's say we want to fetch 1MB chunk at a time so we loop and fetch 1MB content at a time.
start = 0
end = blob_size
chunk_size = 1 * 1024 * 1024 #1MB
do
    start_range = start
    end_range = start + chunk_size - 1
    blob_chunk_content = bb.get_blob_to_text(container_name, blob_name, 
        encoding='utf-8', snapshot=None, start_range=start_range, end_range=end_range, 
        validate_content=False, progress_callback=None, max_connections=2, 
        lease_id=None, if_modified_since=None, if_unmodified_since=None, 
        if_match=None, if_none_match=None, timeout=None)
    #blob_chunk_content will have 1 MB data. Do whatever you like with it.
    start = end_range + 1
while (start < end)

Here is the Python version of Gaurav's Pseudocode. Note that I had to install the azure.storage.blob package by using pip install azure-storage-blob==2.1.0 .

from azure.storage.blob import BlockBlobService

bb = BlockBlobService(account_name='<storage_account_name>', account_key='<sas_key>')
container_name = "<container_name>"
blob_name = "<dir>/<file>"

#First get blob properties. We would want to find out blob's content length
blob = bb.get_blob_properties(container_name=container_name, blob_name=blob_name)

#extract content length from blob's properties
blob_size = blob.properties.content_length

#now let's say we want to fetch 1MB chunk at a time,
# so we loop and fetch 1MB content at a time.
start = 0
end = blob_size
chunk_size = 1 * 1024 * 1024 # 1MB

while start < end:
  start_range = start
  end_range = start + chunk_size - 1
  blob_chunk_content = bb.get_blob_to_text(container_name, blob_name, 
    encoding='utf-8', snapshot=None, start_range=start_range, end_range=end_range, 
    validate_content=False, progress_callback=None, max_connections=2, 
    lease_id=None, if_modified_since=None, if_unmodified_since=None, 
    if_match=None, if_none_match=None, timeout=None)
  print(blob_chunk_content.content)
  start = end_range + 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM