Spark 可以写入 Azure Datalake Gen2 吗？

Question

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.除非使用 Databricks，否则似乎不可能使用 spark 写入 Azure Datalake Gen2。

I'm using jupyter with almond to run spark in a notebook locally.我正在使用带有almond jupyter在本地笔记本中运行 spark。

I have imported the hadoop dependencies:我已经导入了 hadoop 依赖项：

import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`

which allows me to use the wasbs:// protocol when trying to write my dataframe to azure这允许我在尝试将数据帧写入 azure 时使用wasbs://协议

    spark.conf.set(
        "fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net", 
        "?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")

This is where the error comes:这是错误的来源：

val data = spark.read.json(spark.createDataset(
  """{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))

data
  .write
  .orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")

We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:我们现在收到“分层命名空间帐户尚不支持 Blob API”错误：

org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.

So is this indeed impossible?那么这真的不可能吗？ Should I just abandon the Datalake gen2 and just use regular blob storage?我应该放弃 Datalake gen2 并只使用常规 blob 存储吗？ Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.微软确实放弃了创建“数据湖”产品的任务，但没有为带火花的连接器创建文档。

Answer 1

Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client".在 spark 中使用 ADLS Gen2 很简单，微软并没有“放弃”，就像“ASF Spark 附带的 hadoop 二进制文件不包括 ABFS 客户端”一样。 Those in HD/Insights, Cloudera CDH6.x etc do. HD/Insights、Cloudera CDH6.x 等中的那些。

consistently upgrade the hadoop-* JARs to Hadoop 3.2.1.持续将 hadoop-* JAR 升级到 Hadoop 3.2.1。 That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.这意味着所有这些，而不是放入后来的 hadoop-azure-3.2.1 JAR 并期待一切正常。
use abfs:// URLs使用 abfs:// 网址
Configure the client as per the docs . 根据文档配置客户端。

ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. ADLS Gen2 是 Microsoft 部署的最好的对象存储 - 通过分层命名空间，您可以获得 O(1) 目录操作，这对于 spark 意味着高性能任务和作业提交。 Security and permissions are great too.安全性和权限也很棒。

Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.是的，不幸的是，它无法与您拥有的 Spark 发行版配合使用 - 但是 Microsoft 无法将新连接器改装为 2017 年发布的一组工件。您将不得不升级您的依赖关系。

Answer 2

I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access我认为您必须启用预览功能才能将 Blob API 与 Azure DataLake Gen2 一起使用： Data Lake Gen2 Multi-Protocol-Access

Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs".我发现的另一件事：需要通过将“blob”交换为“dfs”来更新端点格式。 See here .见这里。 But I am not sure if that helps with your problem.但我不确定这是否有助于解决您的问题。

On the other hand, you could use the ABFS driver to access the data.另一方面，您可以使用 ABFS 驱动程序来访问数据。 This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver.这不受官方支持，但您可以从无 hadoop 的 spark 解决方案开始，并安装包含驱动程序的较新的 hadoop 版本。 I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution我认为这可能是一个选项，具体取决于您的情况：添加 hadoop ABFS 驱动程序以触发分发

Spark 可以写入 Azure Datalake Gen2 吗？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-11-18 13:08:42

解决方案2
0 2019-10-09 16:24:36

Spark 可以写入 Azure Datalake Gen2 吗？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-11-18 13:08:42

解决方案2 0 2019-10-09 16:24:36

解决方案1
2 已采纳 2019-11-18 13:08:42

解决方案2
0 2019-10-09 16:24:36