简体   繁体   中英

Can Spark write to Azure Datalake Gen2?

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.

I'm using jupyter with almond to run spark in a notebook locally.

I have imported the hadoop dependencies:

import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0` 

which allows me to use the wasbs:// protocol when trying to write my dataframe to azure

    spark.conf.set(
        "fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net", 
        "?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")

This is where the error comes:

val data = spark.read.json(spark.createDataset(
  """{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))

data
  .write
  .orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")

We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:

org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.

So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.

Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.

  1. consistently upgrade the hadoop-* JARs to Hadoop 3.2.1. That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.
  2. use abfs:// URLs
  3. Configure the client as per the docs .

ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.

Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.

I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access

Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs". See here . But I am not sure if that helps with your problem.

On the other hand, you could use the ABFS driver to access the data. This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver. I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM