简体   繁体   English

从本地 Spark 作业连接到 Azure Data Lake Gen 2

[英]Connect to Azure Data Lake Gen 2 from local Spark job

I'm trying to connect from a local Spark job to my ADLS Gen 2 data lake to read some Databricks delta tables, which I've previously stored through a Databricks Notebook, but I'm getting a very weird exception, which I can't sort out:我正在尝试从本地 Spark 作业连接到我的 ADLS Gen 2 数据湖,以读取我之前通过 Databricks Notebook 存储的一些 Databricks 增量表,但是我遇到了一个非常奇怪的异常,我可以'整理一下:

Exception in thread "main" java.io.IOException: There is no primary group for UGI <xxx> (auth:SIMPLE)
    at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:94)

Searching around, I've not found many hints on this.四处搜索,我没有找到很多关于此的提示。 One, which I tried was to pass the config "spark.hadoop.hive.server2.enable.doAs", "false", but it didn't help out.一个,我尝试通过配置“spark.hadoop.hive.server2.enable.doAs”,“false”,但它没有帮助。

I'm using io.delta 0.3.0, Spark 2.4.2_2.12 and azure-hadoop 3.2.0.我正在使用 io.delta 0.3.0、Spark 2.4.2_2.12 和 azure-hadoop 3.2.0。 I can connect to my Gen 2 account without issues through an Azure Databricks Cluster/ Notebook.我可以通过 Azure Databricks Cluster/Notebook 毫无问题地连接到我的 Gen 2 帐户。

I'm using code like the folling:我正在使用如下代码:

 try(final SparkSession spark = SparkSession.builder().appName("DeltaLake").master("local[*]").getOrCreate()) {
            //spark.conf().set("spark.hadoop.hive.server2.enable.doAs", "false");
            spark.conf().set("fs.azure.account.key.stratify.dfs.core.windows.net", "my gen 2 key");
            spark.read().format("delta").load("abfss://myfs@myaccount.dfs.core.windows.net/Test");
}

ADLS Gen2 requires Hadoop 3.2, Spark 3.0.0, and Delta Lake 0.7.0. ADLS Gen2 需要 Hadoop 3.2、Spark 3.0.0 和 Delta Lake 0.7.0。 The requirements are documented in https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2要求记录在https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2

ADLS Gen2 Hadoop connector is only available in Hadoop 3.2.0, and Spark 3.0.0 is the first Spark version that supports Hadoop 3.2. ADLS Gen2 Hadoop 连接器仅在 Hadoop 3.2.0 中可用,Spark 3.0.0 是第一个支持 Hadoop 3.2 的 Spark 版本。

Databricks Runtime 6.x and older versions runs Hadoop 2.7 and Spark 2.4 but ADLS Gen2 Hadoop connector is backported to this old Hadoop version internally. Databricks Runtime 6.x 和旧版本运行 Hadoop 2.7 和 Spark 2.4,但 ADLS Gen2 Hadoop 连接器在内部向后移植到这个旧的 Hadoop 版本。 That's why Delta Lake can work in Databricks without upgrading to Spark 3.0.0.这就是 Delta Lake 可以在 Databricks 中工作而无需升级到 Spark 3.0.0 的原因。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用租户 ID、客户端 ID 和客户端机密连接和管理 Azure Data Lake Storage Gen2 中的目录和文件? - How can I use tenant id, client id and client secret to connect to and manage directories and files in Azure Data Lake Storage Gen2? 使用 Java 获取 Azure Data Lake Gen2 中的文件夹大小 - Obtain Folder size in Azure Data Lake Gen2 using Java 寻找 REST API 以列出 Azure Data Lake Gen2 存储的所有容器 - Looking for REST API to list all Containers of Azure Data Lake Gen2 Storage 用于解析 Azure Data Lake Storage Gen2 URI 的正则表达式,用于使用 Azurite 进行生产和测试 - Regex to parse Azure Data Lake Storage Gen2 URI for production and testing with Azurite 尝试通过 REST API 访问 Azure 数据湖存储 Gen 2 中的文件系统时出现 403 错误 - 403 error when trying to access file system in Azure data lake storage Gen 2 via REST API 从 Java 提交 Azure 突触中的 Spark 作业 - Submit Spark job in Azure Synapse from Java 如何使用 Java 函数存储数据湖 gen2? - How to store data lake gen2 with Java Functions? 尝试从 Spark K8s Operator 连接到 Delta Lake 时遇到 ClassCast 异常 - ClassCast Exception Encountered When Trying To Connect To Delta Lake From Spark K8s Operator SQL Polybase 可以从 Azure datalake gen2 读取数据吗? - Can SQL Polybase read data from Azure datalake gen2? Azure上特定于Azure的本地读取文件 - Azure specific reading files from local on spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM