Question : On my local machine, I can get line counts of a data file by using following python
code. How can we do the same when the file is stored in a container, say, myContainer
in an Azure Data Lake Gen2
storage?
with open('PPPLoanHoldStatus_AprilData.txt', 'r') as fp:
for count, line in enumerate(fp):
pass
print('Total Lines', count + 1)
Remark : When I use the following code in a notebook
in an Azure Databricks
, I get the error shown below:
with open('abfss://myContainer@myAzureDLGen2.dfs.core.windows.net/MyDataFile.txt', 'r') as fp:
for count, line in enumerate(fp):
pass
print('Total Lines', count + 1)
ERROR :
No such file or directory: 'abfss://myContainer@myAzureDLGen2.dfs.core.windows.net/MyDataFile.txt'
If you want to do it without mounting, you can try Azure Data Lake credential passthrough .
To do this you require Azure Databricks workspace with premium plan.
Step-1: Receive log identities by running Set-AzStorageServiceLoggingProperty
command in ADLS.
Step-2: This step can be done in two ways. One is Enabling ADLS credentials passthrough for a High Concurrency cluster and the other one is Enabling for a Standard cluster .
High Concurrency cluster:
Standard Cluster:
Select cluster mode as Standard and enable the passthrough and give the user access from the drop down like below.
You can do it in any way you want.
After creating the cluster, you can access the ADLS Gen2 files from the notebook of this cluster with the abfss://
path.
NOTE:
Please make sure you have Storage Blob Data Contributor role for the ADLS and prefer creating a new cluster and avoid the clusters which are setup with the ADLS credentials before.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.