简体   繁体   中英

How can you process a CSV from Azure Blob Storage as a stream in Python

It is simple to get a StorageStreamDownloader using the azure.storage.blob package:

from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string("my azure connection string")
container_client = blob_service_client.get_container_client("my azure container name")
blob_client = container_client.get_blob_client("my azure file name")
storage_stream_downloader = blob_client.download_blob()

and it is simple to process a file-like object, or more specifically, I think, a string-returning iterator (or the file path of the object) in the csv package:

import csv
from io import StringIO
 
csv_string = """col1, col2
a,b
c,d"""
with StringIO(csv_string) as csv_file:
  for row in csv.reader(csv_file):
    print(row) # or rather whatever I actually want to do on a row by row basis, e.g. ascertain that the file contains a row that meets a certain condition

What I'm struggling with is getting the streaming data from my StorageStreamDownloader into csv.reader() in such a way that I can process each line as it arrives rather than waiting for the whole file to download.

The Microsoft docs strike me as a little underwritten by their standards (the chunks() method has no annotation?) but I see there is a readinto() method for reading into a stream. I have tried reading into a BytesIO stream but cannot work out how to get the data out into csv.reader() without just outputting the buffer to a new file and reading that file. This all strikes me as a thing that should be doable but I'm probably missing something obvious conceptually, perhaps to do with itertools or asyncio , or perhaps I'm just using the wrong csv tool for my needs?

If you want to read csv file on row by one row, you can use the method pd.read_csv(filename, chunksize=1) . For more details, please refer to here and here

For example (I use pandas1.2.1)

with pd.read_csv(content, chunksize=1) as reader:

    for chunk in reader:
        print(chunk)
        print('---------------')

在此处输入图像描述

Besides, if you want to use the method chunks() , we need to set max_chunk_get_size and max_single_get_size to the same value when we create BlobClient . For more details, please refer to here and here

For example

from azure.storage.blob import BlobClient

key = '<account_key>'

blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
                         container_name='input',
                         blob_name='cities.csv',
                         credential=key,
                         max_chunk_get_size=1024,
                         max_single_get_size=1024)
stream = blob_client.download_blob()

for chunk in stream.chunks():
    print(len(chunk))

在此处输入图像描述

Based on a comment by Jim Xu:

stream = blob_client.download_blob()  
with io.BytesIO() as buf:
  stream.readinto(buf)

  # needed to reset the buffer, otherwise, panda won't read from the start
  buf.seek(0)

  data = pd.read_csv(buf)

or

csv_content = blob_client.download_blob().readall()
data = pd.read_csv(io.BytesIO(csv_content ))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM