It is simple to get a StorageStreamDownloader
using the azure.storage.blob package:
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string("my azure connection string")
container_client = blob_service_client.get_container_client("my azure container name")
blob_client = container_client.get_blob_client("my azure file name")
storage_stream_downloader = blob_client.download_blob()
and it is simple to process a file-like object, or more specifically, I think, a string-returning iterator (or the file path of the object) in the csv package:
import csv
from io import StringIO
csv_string = """col1, col2
a,b
c,d"""
with StringIO(csv_string) as csv_file:
for row in csv.reader(csv_file):
print(row) # or rather whatever I actually want to do on a row by row basis, e.g. ascertain that the file contains a row that meets a certain condition
What I'm struggling with is getting the streaming data from my StorageStreamDownloader
into csv.reader()
in such a way that I can process each line as it arrives rather than waiting for the whole file to download.
The Microsoft docs strike me as a little underwritten by their standards (the chunks()
method has no annotation?) but I see there is a readinto()
method for reading into a stream. I have tried reading into a BytesIO
stream but cannot work out how to get the data out into csv.reader()
without just outputting the buffer to a new file and reading that file. This all strikes me as a thing that should be doable but I'm probably missing something obvious conceptually, perhaps to do with itertools
or asyncio
, or perhaps I'm just using the wrong csv tool for my needs?
If you want to read csv file on row by one row, you can use the method pd.read_csv(filename, chunksize=1)
. For more details, please refer to here and here
For example (I use pandas1.2.1)
with pd.read_csv(content, chunksize=1) as reader:
for chunk in reader:
print(chunk)
print('---------------')
Besides, if you want to use the method chunks()
, we need to set max_chunk_get_size
and max_single_get_size
to the same value when we create BlobClient
. For more details, please refer to here and here
For example
from azure.storage.blob import BlobClient
key = '<account_key>'
blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='input',
blob_name='cities.csv',
credential=key,
max_chunk_get_size=1024,
max_single_get_size=1024)
stream = blob_client.download_blob()
for chunk in stream.chunks():
print(len(chunk))
Based on a comment by Jim Xu:
stream = blob_client.download_blob()
with io.BytesIO() as buf:
stream.readinto(buf)
# needed to reset the buffer, otherwise, panda won't read from the start
buf.seek(0)
data = pd.read_csv(buf)
or
csv_content = blob_client.download_blob().readall()
data = pd.read_csv(io.BytesIO(csv_content ))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.