简体   繁体   中英

dask : How to read CSV files into a DataFrame from Microsoft Azure Blob

S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob . Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local machine from cloud. I am looking for solutions that which read the blob to support DASK parallel read as either stream or string without persisting to local disk .

I have newly pushed code here: https://github.com/dask/dask-adlfs

You may pip-install from that location, although you may be best served by conda-installing the requirements (dask, cffi, oauthlib) beforehand. In a python session, doing import dask_adlfs will be enough to register the backend with Dask, such that thereafter you can use azure URLs with dask functions like:

import dask.dataframe as dd
df = dd.read_csv('adl://mystore/path/to/*.csv', storage_options={
    tenant_id='mytenant', client_id='myclient', 
    client_secret='mysecret'})

Since this code is totally brand new and untested, expect rough edges. With luck, you can help iron out those edges.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM