[英]Connect to Azure Data Lake Gen 2 from local Spark job
I'm trying to connect from a local Spark job to my ADLS Gen 2 data lake to read some Databricks delta tables, which I've previously stored through a Databricks Notebook, but I'm getting a very weird exception, which I can't sort out:我正在尝试从本地 Spark 作业连接到我的 ADLS Gen 2 数据湖,以读取我之前通过 Databricks Notebook 存储的一些 Databricks 增量表,但是我遇到了一个非常奇怪的异常,我可以'整理一下:
Exception in thread "main" java.io.IOException: There is no primary group for UGI <xxx> (auth:SIMPLE)
at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:94)
Searching around, I've not found many hints on this.四处搜索,我没有找到很多关于此的提示。 One, which I tried was to pass the config "spark.hadoop.hive.server2.enable.doAs", "false", but it didn't help out.
一个,我尝试通过配置“spark.hadoop.hive.server2.enable.doAs”,“false”,但它没有帮助。
I'm using io.delta 0.3.0, Spark 2.4.2_2.12 and azure-hadoop 3.2.0.我正在使用 io.delta 0.3.0、Spark 2.4.2_2.12 和 azure-hadoop 3.2.0。 I can connect to my Gen 2 account without issues through an Azure Databricks Cluster/ Notebook.
我可以通过 Azure Databricks Cluster/Notebook 毫无问题地连接到我的 Gen 2 帐户。
I'm using code like the folling:我正在使用如下代码:
try(final SparkSession spark = SparkSession.builder().appName("DeltaLake").master("local[*]").getOrCreate()) {
//spark.conf().set("spark.hadoop.hive.server2.enable.doAs", "false");
spark.conf().set("fs.azure.account.key.stratify.dfs.core.windows.net", "my gen 2 key");
spark.read().format("delta").load("abfss://myfs@myaccount.dfs.core.windows.net/Test");
}
ADLS Gen2 requires Hadoop 3.2, Spark 3.0.0, and Delta Lake 0.7.0. ADLS Gen2 需要 Hadoop 3.2、Spark 3.0.0 和 Delta Lake 0.7.0。 The requirements are documented in https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2
要求记录在https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2
ADLS Gen2 Hadoop connector is only available in Hadoop 3.2.0, and Spark 3.0.0 is the first Spark version that supports Hadoop 3.2. ADLS Gen2 Hadoop 连接器仅在 Hadoop 3.2.0 中可用,Spark 3.0.0 是第一个支持 Hadoop 3.2 的 Spark 版本。
Databricks Runtime 6.x and older versions runs Hadoop 2.7 and Spark 2.4 but ADLS Gen2 Hadoop connector is backported to this old Hadoop version internally. Databricks Runtime 6.x 和旧版本运行 Hadoop 2.7 和 Spark 2.4,但 ADLS Gen2 Hadoop 连接器在内部向后移植到这个旧的 Hadoop 版本。 That's why Delta Lake can work in Databricks without upgrading to Spark 3.0.0.
这就是 Delta Lake 可以在 Databricks 中工作而无需升级到 Spark 3.0.0 的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.