简体   繁体   English

从 ADLS gen2 上的增量表流式传输时出现 AzureBlobFileSystem FileNotFoundException

[英]AzureBlobFileSystem FileNotFoundException when streaming from a Delta table on ADLS gen2

When I stream data from a Delta table hosted on Azure Datalake Storage (ADLS) Gen2, the stream works for a little bit before failing with the error below.当我从 Azure Datalake Storage (ADLS) Gen2 上托管的 Delta 表中获取 stream 数据时,stream 在失败并出现以下错误之前工作了一点。 The error says that the path doesn't exist, but I can see in the storage logs that files are successfully being written and read from that path before and after the error.该错误表明该路径不存在,但我可以在存储日志中看到在错误前后文件已成功写入并从该路径读取。 It seems safe to say that the path does exist in Azure Storage, despite the exception.可以肯定地说,该路径确实存在于 Azure 存储中,尽管有例外。

For context:对于上下文:

  • I am using Spark 3.1 (pySpark)我正在使用 Spark 3.1 (pySpark)
  • I have a separate stream actively writing data to the delta table via a ForeachBatch sink.我有一个单独的 stream 通过ForeachBatch器主动将数据写入增量表。
  • The delta table is a managed table.增量表是一个托管表。
  • This happens when the input and output streams are running on the same cluster and separate clusters.当输入流和 output 流在同一个集群和不同的集群上运行时,就会发生这种情况。
  • I am using Azure Synapse.我正在使用 Azure 突触。

Fixes I've tried:我尝试过的修复:

  1. Increasing the batch execution interval from None to 10 seconds .将批处理执行间隔None增加到10 seconds After this, the query went from failing after ~15 minutes with the error below to failing after a little over an hour.在此之后,查询从 ~15 分钟后失败并出现以下错误变为一个多小时后失败。
  2. Switching to a premium tier ADLS account (no effect).切换到高级 ADLS 帐户(无效)。

I found one other person with this error, but no solution was provided: https://github.com/delta-io/delta/issues/932 since it was asked to the wrong audience.我发现另一个人有这个错误,但没有提供解决方案: https://github.com/delta-io/delta/issues/932因为它被问错了观众。 It seems that a simple repro can be made by reading and writing Spark streams to a delta table hosted on ADLS gen2, based on their Issue.根据他们的问题,似乎可以通过将 Spark 流读取和写入托管在 ADLS gen2 上的增量表来进行简单的重现。

How can I pin down the root cause?我怎样才能确定根本原因? Are there any Spark or ADLS settings I can change to mitigate this?我可以更改任何 Spark 或 ADLS 设置来缓解这种情况吗?

22/03/19 02:06:20 ERROR MicroBatchExecution: Query [id = 00f1d866-74a2-42f9-8fb6-c8d1a76e00a6, runId = 902f8480-4dc6-4a7d-aada-bfe3b660d288] terminated with error
java.io.FileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://example.dfs.core.windows.net/workspace?upn=false&resource=filesystem&maxResults=5000&directory=synapse/workspaces/example/warehouse/my_table/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:e8c6fb8e-101f-00cb-5c35-3b717e000000 Time:2022-03-19T02:06:20.8352277Z"
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1178)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:408)
    at org.apache.spark.sql.delta.storage.HadoopFileSystemLogStore.listFrom(HadoopFileSystemLogStore.scala:69)
    at org.apache.spark.sql.delta.DeltaLog.getChanges(DeltaLog.scala:227)
    at org.apache.spark.sql.delta.sources.DeltaSource.filterAndIndexDeltaLogs$1(DeltaSource.scala:190)
    at org.apache.spark.sql.delta.sources.DeltaSource.getFileChanges(DeltaSource.scala:203)
    at org.apache.spark.sql.delta.sources.DeltaSourceBase.getFileChangesAndCreateDataFrame(DeltaSource.scala:117)
    at org.apache.spark.sql.delta.sources.DeltaSourceBase.getFileChangesAndCreateDataFrame$(DeltaSource.scala:112)
    at org.apache.spark.sql.delta.sources.DeltaSource.getFileChangesAndCreateDataFrame(DeltaSource.scala:144)
    at org.apache.spark.sql.delta.sources.DeltaSource.getBatch(DeltaSource.scala:385)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$3(MicroBatchExecution.scala:486)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:27)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:27)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$2(MicroBatchExecution.scala:482)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:482)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
    at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
Caused by: Operation failed: "The specified path does not exist.", 404, GET, https://example.dfs.core.windows.net/workspace?upn=false&resource=filesystem&maxResults=5000&directory=synapse/workspaces/example/warehouse/my_table/_delta_log&timeout=90&recursive=false, PathNotFound, "The specified path does not exist. RequestId:e8c6fb8e-101f-00cb-5c35-3b717e000000 Time:2022-03-19T02:06:20.8352277Z"
    at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:207)
    at org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:231)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:905)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:876)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:858)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:405)
    ... 37 more

Please check with below points:请检查以下几点:

  • Please make sure to check if you have permissions.请确保检查您是否有权限。 You will need the Storage Blob Data Contributor role on the storage account/ (contributor access to the data lake and the container) to work.您将需要存储帐户上的存储 Blob 数据贡献者角色/(对数据湖和容器的贡献者访问)才能工作。
  • Try to restart the cluster by clearing the cache.尝试通过清除缓存来重新启动集群。
  • Recreating the cluster can be another troubleshooting step to ensure everything is done correctly.重新创建集群可能是另一个故障排除步骤,以确保一切正确完成。
  • The cause can be a.network issue in Azure too: transient-faults原因也可能是 Azure 中的网络问题: 瞬态故障
  • Try by changing the container's access level to Anonymous access.尝试将容器的访问级别更改为匿名访问。
  • Also check with the path again in hdfs.还要再次检查 hdfs 中的路径。

Otherwise the problem cause can be Having multiple jobs writing to the same cluster, and one is cleaning up while the other is setting up and getting mixed up否则问题的原因可能是有多个作业写入同一个集群,一个正在清理而另一个正在设置并混淆

Note笔记

  1. To avoid the error to some extent please make sure your jobs are not writing to the same table simultaneously.为了在某种程度上避免错误,请确保您的作业没有同时写入同一个表。
  2. Work with the most recent version of spark you can work with.使用您可以使用的最新版本的 spark。

References:参考:

  1. Databricks error while trying to create delta table on ADLS Gen2 - Stack Overflow 尝试在 ADLS Gen2 上创建增量表时发生 Databricks 错误 - Thinbug
  2. azure databricks - the Dataset/DataFrame - Stack Overflow azure databricks - 数据集/数据帧 - 堆栈内存溢出
  3. python - FileNotFoundException- Stack Overflow python - FileNotFoundException - 堆栈内存溢出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 即使在增量优化后,ADLS Gen2 位置中的小文件也可用 - Small files available in ADLS Gen2 location even after delta optimization 将数据从本地 sql 服务器复制到 Azure Data Lake Storage Gen2 中的增量格式 - copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2 如何使用 Terraform 批准 Blob 存储 ADLS Gen2 上的托管专用端点? - How to use Terraform to approve a Managed Private Endpoint on a Blob Storage ADLS Gen2? SAS 令牌使用 Azure java 目录级别的 ADLS Gen2 AD 服务原则 - SAS token using Azure AD Service Principle for ADLS Gen2 at directory level in java Google Cloud Functions Gen2:无法从 europe-west4 中提取 Python 图像 - Google Cloud Functions Gen2: Unable to pull Python image from europe-west4 部署时如何解决 GCP Cloud Function (Gen2) 缺少端口的问题? - How to fix missing port issue with GCP Cloud Function (Gen2) when deployed? firebase 路由上 golang 中的 google cloud function gen2 的身份验证问题 - Authentication problem with google cloud function gen2 in golang on firebase routing Delta 表 / 雅典娜与火花 - Delta Table / Athena And Spark 使用 spark 和 python adls gen 2 仅列出子文件夹名称 - List only the subfolder names using spark and python adls gen 2 部署 Gen2 云时出现权限被拒绝错误 Function - Permission Denied Error while deploying Gen2 Cloud Function
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM