简体   繁体   English

dask:如何从Microsoft Azure Blob将CSV文件读入DataFrame

[英]dask : How to read CSV files into a DataFrame from Microsoft Azure Blob

S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob . S3F是S3的Pythonic文件接口, DASK是否具有Azure存储Blob的任何Pythonic接口 Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local machine from cloud. 用于Azure存储Blob的Python SDK提供了对blob进行读写的方法,但该接口要求将文件从云下载到本地计算机。 I am looking for solutions that which read the blob to support DASK parallel read as either stream or string without persisting to local disk . 我正在寻找能够读取blob以支持DASK并行读取为流或字符串而不持久保存到本地磁盘的解决方案

I have newly pushed code here: https://github.com/dask/dask-adlfs 我在这里新推了代码: https//github.com/dask/dask-adlfs

You may pip-install from that location, although you may be best served by conda-installing the requirements (dask, cffi, oauthlib) beforehand. 您可以从该位置进行pip-install,尽管最好通过conda安装需求(dask,cffi,oauthlib)来提供服务。 In a python session, doing import dask_adlfs will be enough to register the backend with Dask, such that thereafter you can use azure URLs with dask functions like: 在python会话中,执行import dask_adlfs就足以使用Dask注册后端,这样以后你可以使用具有dask函数的azure URL:

import dask.dataframe as dd
df = dd.read_csv('adl://mystore/path/to/*.csv', storage_options={
    tenant_id='mytenant', client_id='myclient', 
    client_secret='mysecret'})

Since this code is totally brand new and untested, expect rough edges. 由于此代码完全是全新的且未经测试,因此需要粗糙的边缘。 With luck, you can help iron out those edges. 幸运的话,你可以帮助解决这些问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Azure blob Storage 读取 csv 并存储在 DataFrame - Read csv from Azure blob Storage and store in a DataFrame 如何使用 dask 从同一目录中读取多个 .csv 文件? - How read multiple .csv files from the same directory using dask? 如何读取 Azure 文件共享的文件为 csv 即 pandas dataframe - How to read the files of Azure file share as csv that is pandas dataframe 如何使用 Python 从 azure blob 读取 docx 文件 - How to read docx files from azure blob using Python 使用dask.dataframe从CSV文件中按分区读取尾部 - Read tail by partition from CSV file with dask.dataframe 如何使用 DASK dataframe 读取 csv 以使其没有“未命名:0”列? - How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column? 如何将压缩(gz)CSV文件读入dask Dataframe? - How to read a compressed (gz) CSV file into a dask Dataframe? 从 azure blob 并行读取多个文件 - Read multiple files in parellel from an azure blob 在 azure ml 中运行笔记本时如何最好地从 azure blob csv 格式转换为 pandas 数据帧 - How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml 如何确保 dask 在查询分区数据帧时不会从磁盘读取不必要的文件? - How do I ensure that dask doesn't read unnecessary files from disk when querying a partitioned dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM