简体   繁体   中英

Can't use wildcard with Azure Data Lake Gen2 files

I was able to properly connect my Data Lake Gen2 Storage Account with my Azure ML Workspace. When trying to read a specific set of Parquet files from the Datastore, it will take forever and will not load it.

The code looks like:

from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath

ws = Workspace(subscription_id, resource_group, workspace_name)

datastore = Datastore.get(ws, 'my-datastore')

files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/*.parquet'

dataset = Dataset.Tabular.from_parquet_files(path=[DataPath(datastore, files_path)], validate=False)
df = dataset.take(1000)

df.to_pandas_dataframe()

Each of these Parquet files have approx. 300kB. There are 200 of them on the folder - generic and straight out of Databricks. Strange is that when I try and read one single parquet file from the exact same folder, it runs smoothly.

Second is that other folders that contain less than say 20 files, will also run smoothly, so I eliminated the possibility that this was due to some connectivity issue. And even stranger is that I tried the wildcard like the following:

# files_path = 'Brazil/CommandCenter/Invoices/dt_folder=2020-05-11/part-00000-*.parquet'

And theoretically this will only direct me to the 00000 file, but it will also not load. Super weird.

To try to overcome this, I have tried to connect to the Data Lake through ADLFS with Dask, and it just works. I know this can be a workaround for processing "large" datasets/files, but it would be super nice to do it straight from the Dataset class methods.

Any thoughts?

EDIT: typo

The issue can be solved if you update some packages with the following command:

pip install --upgrade azureml-dataprep azureml-dataprep-rslex

This is something that will come out fixed in the next azureml.core update, as I was told by some folks at Microsoft.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM