简体   繁体   中英

How to access captured data from Event Hub in Azure Data Lake Storage Gen2 using Python

I'm using the connection_string to access an Azure Data Lake Gen2 storage, in which lots of Avro files were stored by an Event Hubs Capture, under the typical directory structure containing folders named by year/month/day/hour/minute. I'm using the azure.storage.filedatalake package.

Firstly I get a Data Lake service client using:

datalake_service_client = DataLakeServiceClient.from_connection_string(connection_string)

And then I get the file systems in the lake by:

file_systems = datalake_service_client.list_file_systems()
for file_system in file_systems:
    print(file_system.name)

There is only one file system in this case, called "datalake1". At this point I want to access to all the Avro files I expect to find therein. I'm trying by firstly getting a file system client:

file_system_client = datalake_service_client.get_file_system_client("datalake1")

and then by using the get_paths method:

file_system_client.get_paths()

It returns an iterator (azure.core.paging.ItemPaged object), but from here I'm not being able to see the folders and files. I tried with a simple list comprehension like [x.name for x in file_system_client.get_paths()] but I got the error StorageErrorException: Operation returned an invalid status 'The specified container does not exist.'

Any idea about how to access the Avro files following this procedure?

EDIT: I'm using azure-storage-file-datalake version 12.0.0. Here a screenshot of the code:

在此处输入图像描述

Thanks

update:

Tested it with your code:

在此处输入图像描述


original answer:

After you call get_paths() method, you can use is_directory property to determine if it's a directory or a file. If it's a file, then you can do something with it.

The sample code(in this sample, I just print out the .avro file path. Please feel free to modify the code to meet your need):

#other code
paths = file_system_client.get_paths()

for path in paths:
    #determine if it is a directory or a file
    if not path.is_directory:
        #here, just print out the file name.
        print(path.name + '\n')
        #you can do other operations here.

The test result:

在此处输入图像描述

The problem was the connection string. I tried again but taking it from the "Access keys" blade in the Azure portal, and now it's working fine. I managed to run correctly get_paths() and more. The previous connection string was taken from the Storage Explorer, which corresponds to the connection string retrieved from the "Shared access signature" blade. Credits to @MartinJaffer-MSFT ( MSDN ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM