[英]dask : How to read CSV files into a DataFrame from Microsoft Azure Blob
S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob . S3F是S3的Pythonic文件接口, DASK是否具有Azure存储Blob的任何Pythonic接口 。 Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local machine from cloud.
用于Azure存储Blob的Python SDK提供了对blob进行读写的方法,但该接口要求将文件从云下载到本地计算机。 I am looking for solutions that which read the blob to support DASK parallel read as either stream or string without persisting to local disk .
我正在寻找能够读取blob以支持DASK并行读取为流或字符串而不持久保存到本地磁盘的解决方案 。
I have newly pushed code here: https://github.com/dask/dask-adlfs 我在这里新推了代码: https : //github.com/dask/dask-adlfs
You may pip-install from that location, although you may be best served by conda-installing the requirements (dask, cffi, oauthlib) beforehand. 您可以从该位置进行pip-install,尽管最好通过conda安装需求(dask,cffi,oauthlib)来提供服务。 In a python session, doing
import dask_adlfs
will be enough to register the backend with Dask, such that thereafter you can use azure URLs with dask functions like: 在python会话中,执行
import dask_adlfs
就足以使用Dask注册后端,这样以后你可以使用具有dask函数的azure URL:
import dask.dataframe as dd
df = dd.read_csv('adl://mystore/path/to/*.csv', storage_options={
tenant_id='mytenant', client_id='myclient',
client_secret='mysecret'})
Since this code is totally brand new and untested, expect rough edges. 由于此代码完全是全新的且未经测试,因此需要粗糙的边缘。 With luck, you can help iron out those edges.
幸运的话,你可以帮助解决这些问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.