Why can't Databricks Python read from my Azure Datalake Storage Gen1?

Question

I am trying to read a file mydir/mycsv.csv from Azure Data Lake Storage Gen1 from a Databricks notebook, using the syntax (inspired by the documentation )

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "123abc-1e42-31415-9265-12345678",
           "dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "adla", key = "adlamaywork"),
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token"}

dbutils.fs.mount(
  source = "adl://myadls.azuredatalakestore.net/mydir",
  mount_point = "/mnt/adls",
  extra_configs = configs)

post_processed = spark.read.csv("/mnt/adls/mycsv.csv").collect()

post_processed.head(10).to_csv("/dbfs/processed.csv")

dbutils.fs.unmount("/mnt/adls")

My client 123abc-1e42-31415-9265-12345678 has access to the Data Lake Storage myadls and I have created secrets with

databricks secrets put --scope adla --key adlamaywork

When I execute the pyspark code above in the Databricks notebook, when accessing the csv file with spark.read.csv , I get

com.microsoft.azure.datalake.store.ADLException: Error getting info for file /mydir/mycsv.csv

When navigating the dbfs with dbfs ls dbfs:/mnt/adls , the parent mount point seems to be there, but I get

Error: b'{"error_code":"IO_ERROR","message":"Error fetching access token\\nLast encountered exception thrown after 1 tries [HTTP0(null)]"}'

What am I doing wrong?

Answer 1

If you do not necessarily need to mount the directory into dbfs, you could try to read directly from adls, like this :

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.access.token.provider", "org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")
spark.conf.set("dfs.adls.oauth2.client.id", "123abc-1e42-31415-9265-12345678")
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adla", key = "adlamaywork"))
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token")

csvFile = "adl://myadls.azuredatalakestore.net/mydir/mycsv.csv"

df = spark.read.format('csv').options(header='true', inferschema='true').load(csvFile)

Why can't Databricks Python read from my Azure Datalake Storage Gen1?

Question

1 answers

solution1
1 2019-12-05 16:31:23

Why can't Databricks Python read from my Azure Datalake Storage Gen1?

Question

1 answers

solution1 1 2019-12-05 16:31:23

solution1
1 2019-12-05 16:31:23