简体   繁体   English

Spark 可以写入 Azure Datalake Gen2 吗?

[英]Can Spark write to Azure Datalake Gen2?

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.除非使用 Databricks,否则似乎不可能使用 spark 写入 Azure Datalake Gen2。

I'm using jupyter with almond to run spark in a notebook locally.我正在使用带有almond jupyter在本地笔记本中运行 spark。

I have imported the hadoop dependencies:我已经导入了 hadoop 依赖项:

import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0` 

which allows me to use the wasbs:// protocol when trying to write my dataframe to azure这允许我在尝试将数据帧写入 azure 时使用wasbs://协议

    spark.conf.set(
        "fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net", 
        "?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")

This is where the error comes:这是错误的来源:

val data = spark.read.json(spark.createDataset(
  """{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))

data
  .write
  .orc("wasbs://[filesystem]@[datalakegen2storageaccount].blob.core.windows.net/lalalalala")

We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:我们现在收到“分层命名空间帐户尚不支持 Blob API”错误:

org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.

So is this indeed impossible?那么这真的不可能吗? Should I just abandon the Datalake gen2 and just use regular blob storage?我应该放弃 Datalake gen2 并只使用常规 blob 存储吗? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.微软确实放弃了创建“数据湖”产品的任务,但没有为带火花的连接器创建文档。

Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client".在 spark 中使用 ADLS Gen2 很简单,微软并没有“放弃”,就像“ASF Spark 附带的 hadoop 二进制文件不包括 ABFS 客户端”一样。 Those in HD/Insights, Cloudera CDH6.x etc do. HD/Insights、Cloudera CDH6.x 等中的那些。

  1. consistently upgrade the hadoop-* JARs to Hadoop 3.2.1.持续将 hadoop-* JAR 升级到 Hadoop 3.2.1。 That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.这意味着所有这些,而不是放入后来的 hadoop-azure-3.2.1 JAR 并期待一切正常。
  2. use abfs:// URLs使用 abfs:// 网址
  3. Configure the client as per the docs . 根据文档配置客户端。

ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. ADLS Gen2 是 Microsoft 部署的最好的对象存储 - 通过分层命名空间,您可以获得 O(1) 目录操作,这对于 spark 意味着高性能任务和作业提交。 Security and permissions are great too.安全性和权限也很棒。

Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.是的,不幸的是,它无法与您拥有的 Spark 发行版配合使用 - 但是 Microsoft 无法将新连接器改装为 2017 年发布的一组工件。您将不得不升级您的依赖关系。

I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access我认为您必须启用预览功能才能将 Blob API 与 Azure DataLake Gen2 一起使用: Data Lake Gen2 Multi-Protocol-Access

Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs".我发现的另一件事:需要通过将“blob”交换为“dfs”来更新端点格式。 See here .这里 But I am not sure if that helps with your problem.但我不确定这是否有助于解决您的问题。

On the other hand, you could use the ABFS driver to access the data.另一方面,您可以使用 ABFS 驱动程序来访问数据。 This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver.这不受官方支持,但您可以从无 hadoop 的 spark 解决方案开始,并安装包含驱动程序的较新的 hadoop 版本。 I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution我认为这可能是一个选项,具体取决于您的情况: 添加 hadoop ABFS 驱动程序以触发分发

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure Databrics - 从 Gen2 DataLake 存储运行 Spark Jar - Azure Databrics - Running a Spark Jar from Gen2 DataLake Storage SQL Polybase 可以从 Azure datalake gen2 读取数据吗? - Can SQL Polybase read data from Azure datalake gen2? azure 数据湖 (gen2) 日志中的用户 ID - Userid in azure datalake (gen2) log Azure Datalake Store Gen2中的数据屏蔽 - Data masking in Azure Datalake Store Gen2 azure datalake gen2 databricks ACL 权限 - azure datalake gen2 databricks ACLs permissions 将数据帧从 Azure Databricks 笔记本写入 Azure DataLake Gen2 表 - Write DataFrame from Azure Databricks notebook to Azure DataLake Gen2 Tables Azure 数据块 - 无法使用来自数据湖存储 gen2 服务的 Spark 作业读取 .csv 文件 - Azure databricks - not able to read .csv files using spark jobs from datalake storage gen2 service 在没有 Azure DataFactory 的情况下将文件和文件夹从 Azure DataLake Gen1 复制到 Azure DataLake Gen2 - Copy files and folders from Azure DataLake Gen1 to Azure DataLake Gen2 without Azure DataFactory Azure Datalake Gen2 作为 Azure 数据资源管理器的外部表 - Azure Datalake Gen2 as external table for Azure Data Explorer 如何将 datalake gen1 数据集迁移到 azure 数据工厂中的 datalake gen2? - How to migrate datalake gen1 datasets to datalake gen2 in azure data factory?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM