简体   繁体   中英

how to read a big csv file in azure blog

I will get a HUGE csv file as blob in azure, and I need to parse line by line, in an azure function

I am reading each of the blobs in my container and then I get is as a string, but I think that load everything, and then I split it by new lines. is there a smarter way to do this?

container_name = "test"
block_blob_service = BlockBlobService(account_name=container_name, account_key="mykey")
a = block_blob_service.get_container_properties(container_name)
generator = block_blob_service.list_blobs(container_name)

for b in generator:
    r = block_blob_service.get_blob_to_text(container_name, b.name)
    for i in r.content.split("\n"):
        print(i)

I am not sure how huge your huge is, but for very large files > 200MB or so I would use a streaming approach. The call get_blob_to_text downloads the entire file in one go and places it all in memory. Using get_blob_to_stream allows you to read line by line and process individually, with only the current line and your working set in memory. This is very fast and very memory efficient. We use a similar approach to split 1GB files in to smaller files. 1GB takes a couple of minutes to process.

Keep in mind that depending on your function app service plan the maximum execution time is 5 mins by default (you can increase this to 10 minutes in the hosts.json). Also, on consumption plan, you are limited to 1.5 GB memory on each function service (not per function - for all functions in your function PaaS). So be aware of these limits.

From the docs :

get_blob_to_stream(container_name, blob_name, stream, snapshot=None, start_range=None, end_range=None, validate_content=False, progress_callback=None, max_connections=2, lease_id=None, if_modified_since=None, if_unmodified_since=None, if_match=None, if_none_match=None, timeout=None)

Here is a good read on the topic

After reading other websites and modifying some of the code on the link above,

import io
import datetime
from azure.storage.blob import BlockBlobService

acc_name = 'myaccount'
acc_key = 'my key'
container = 'storeai'
blob = "orderingai2.csv"

block_blob_service = BlockBlobService(account_name=acc_name, account_key=acc_key)
props = block_blob_service.get_blob_properties(container, blob)
blob_size = int(props.properties.content_length)
index = 0
chunk_size =  104,858 # = 0.1meg don't make this to big or you will get memory error
output = io.BytesIO()


def worker(data):
    print(data)


while index < blob_size:
    now_chunk = datetime.datetime.now()
    block_blob_service.get_blob_to_stream(container, blob, stream=output, start_range=index, end_range=index + chunk_size - 1, max_connections=50)
    if output is None:
        continue
    output.seek(index)
    data = output.read()
    length = len(data)
    index += length
    if length > 0:
        worker(data)
        if length < chunk_size:
          break
    else:
      break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM