简体   繁体   中英

Why can't Databricks Python read from my Azure Datalake Storage Gen1?

I am trying to read a file mydir/mycsv.csv from Azure Data Lake Storage Gen1 from a Databricks notebook, using the syntax (inspired by the documentation )

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "123abc-1e42-31415-9265-12345678",
           "dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "adla", key = "adlamaywork"),
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token"}

dbutils.fs.mount(
  source = "adl://myadls.azuredatalakestore.net/mydir",
  mount_point = "/mnt/adls",
  extra_configs = configs)

post_processed = spark.read.csv("/mnt/adls/mycsv.csv").collect()

post_processed.head(10).to_csv("/dbfs/processed.csv")

dbutils.fs.unmount("/mnt/adls")

My client 123abc-1e42-31415-9265-12345678 has access to the Data Lake Storage myadls and I have created secrets with

databricks secrets put --scope adla --key adlamaywork

When I execute the pyspark code above in the Databricks notebook, when accessing the csv file with spark.read.csv , I get

com.microsoft.azure.datalake.store.ADLException: Error getting info for file /mydir/mycsv.csv

When navigating the dbfs with dbfs ls dbfs:/mnt/adls , the parent mount point seems to be there, but I get

Error: b'{"error_code":"IO_ERROR","message":"Error fetching access token\\nLast encountered exception thrown after 1 tries [HTTP0(null)]"}'

What am I doing wrong?

If you do not necessarily need to mount the directory into dbfs, you could try to read directly from adls, like this :

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.access.token.provider", "org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")
spark.conf.set("dfs.adls.oauth2.client.id", "123abc-1e42-31415-9265-12345678")
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adla", key = "adlamaywork"))
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token")

csvFile = "adl://myadls.azuredatalakestore.net/mydir/mycsv.csv"

df = spark.read.format('csv').options(header='true', inferschema='true').load(csvFile)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM