从本地 Spark 作业连接到 Azure Data Lake Gen 2

Question

I'm trying to connect from a local Spark job to my ADLS Gen 2 data lake to read some Databricks delta tables, which I've previously stored through a Databricks Notebook, but I'm getting a very weird exception, which I can't sort out:我正在尝试从本地 Spark 作业连接到我的 ADLS Gen 2 数据湖，以读取我之前通过 Databricks Notebook 存储的一些 Databricks 增量表，但是我遇到了一个非常奇怪的异常，我可以'整理一下：

Exception in thread "main" java.io.IOException: There is no primary group for UGI <xxx> (auth:SIMPLE)
    at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:94)

Searching around, I've not found many hints on this.四处搜索，我没有找到很多关于此的提示。 One, which I tried was to pass the config "spark.hadoop.hive.server2.enable.doAs", "false", but it didn't help out.一个，我尝试通过配置“spark.hadoop.hive.server2.enable.doAs”，“false”，但它没有帮助。

I'm using io.delta 0.3.0, Spark 2.4.2_2.12 and azure-hadoop 3.2.0.我正在使用 io.delta 0.3.0、Spark 2.4.2_2.12 和 azure-hadoop 3.2.0。 I can connect to my Gen 2 account without issues through an Azure Databricks Cluster/ Notebook.我可以通过 Azure Databricks Cluster/Notebook 毫无问题地连接到我的 Gen 2 帐户。

I'm using code like the folling:我正在使用如下代码：

 try(final SparkSession spark = SparkSession.builder().appName("DeltaLake").master("local[*]").getOrCreate()) {
            //spark.conf().set("spark.hadoop.hive.server2.enable.doAs", "false");
            spark.conf().set("fs.azure.account.key.stratify.dfs.core.windows.net", "my gen 2 key");
            spark.read().format("delta").load("abfss://myfs@myaccount.dfs.core.windows.net/Test");
}

Answer 1

ADLS Gen2 requires Hadoop 3.2, Spark 3.0.0, and Delta Lake 0.7.0. ADLS Gen2 需要 Hadoop 3.2、Spark 3.0.0 和 Delta Lake 0.7.0。 The requirements are documented in https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2要求记录在https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2

ADLS Gen2 Hadoop connector is only available in Hadoop 3.2.0, and Spark 3.0.0 is the first Spark version that supports Hadoop 3.2. ADLS Gen2 Hadoop 连接器仅在 Hadoop 3.2.0 中可用，Spark 3.0.0 是第一个支持 Hadoop 3.2 的 Spark 版本。

Databricks Runtime 6.x and older versions runs Hadoop 2.7 and Spark 2.4 but ADLS Gen2 Hadoop connector is backported to this old Hadoop version internally. Databricks Runtime 6.x 和旧版本运行 Hadoop 2.7 和 Spark 2.4，但 ADLS Gen2 Hadoop 连接器在内部向后移植到这个旧的 Hadoop 版本。 That's why Delta Lake can work in Databricks without upgrading to Spark 3.0.0.这就是 Delta Lake 可以在 Databricks 中工作而无需升级到 Spark 3.0.0 的原因。

从本地 Spark 作业连接到 Azure Data Lake Gen 2

问题描述

1 个解决方案

解决方案1
0 2020-07-15 07:07:52

从本地 Spark 作业连接到 Azure Data Lake Gen 2

问题描述

1 个解决方案

解决方案1 0 2020-07-15 07:07:52

解决方案1
0 2020-07-15 07:07:52