Spark worker在临时shuffle文件上抛出FileNotFoundException

Question

I am running a Spark application that processes multiple sets of data points; 我正在运行一个处理多组数据点的Spark应用程序; some of these sets need to be processed sequentially. 其中一些需要按顺序处理。 When running the application for small sets of data points (ca. 100), everything works fine. 在为小型数据点（约100）运行应用程序时，一切正常。 But in some cases, the sets will have a size of ca. 但在某些情况下，这些套装的大小约为。 10,000 data points, and those cause the worker to crash with the following stack trace: 10,000个数据点，这些数据点会导致工作程序崩溃，并显示以下堆栈跟踪：

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 36, 10.40.98.10, executor 1): java.io.FileNotFoundException: /tmp/spark-5198d746-6501-4c4d-bb1c-82479d5fd48f/executor-a1d76cc1-a3eb-4147-b73b-29742cfd652d/blockmgr-d2c5371b-1860-4d8b-89ce-0b60a79fa394/3a/temp_shuffle_94d136c9-4dc4-439e-90bc-58b18742011c (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I have checked all log files after multiple instances of this error, but did not find any other error messages. 我在此错误的多个实例后检查了所有日志文件，但未找到任何其他错误消息。

Searching the internet for this problem, I have found two potential causes that do not seem to be applicable to my situation: 在互联网上搜索这个问题，我发现了两个似乎不适用于我的情况的潜在原因：

The user running the Spark process does not have read/write permission in the /tmp/ directory. 运行Spark进程的用户在/tmp/目录中没有读/写权限。
- Seeing as the error occurs only for larger datasets (instead of always), I do not expect this to be the problem. 看到错误只发生在较大的数据集上（而不是总是），我不认为这是问题所在。
The /tmp/ directory does not have enough space for shuffle files (or other temporary Spark files). /tmp/目录没有足够的空间用于shuffle文件（或其他临时Spark文件）。
- The /tmp/ directory on my system has about 45GB available, and the amount of data in a single data point (< 1KB) means that this is also probably not the case. 我系统上的/tmp/目录大约有45GB可用，单个数据点（<1KB）中的数据量意味着可能不是这种情况。

I have been flailing at this problem for a couple of hours, trying to find work-arounds and possible causes. 我一直在喋喋不休地讨论这个问题几个小时，试图找到解决方法和可能的原因。

I have tried reducing the cluster (which is normally two machines) to a single worker, running on the same machine as the driver, in the hope that this would eliminate the need for shuffles and thus prevent this error. 我已经尝试将集群（通常是两台机器）减少到一个工作器，在与驱动程序相同的机器上运行，希望这样可以消除shuffle的需要，从而防止出现这种错误。 This did not work; 这没用; the error occurs in exactly the same way. 错误以完全相同的方式发生。
I have isolated the problem to an operation that processes a dataset sequentially through a tail-recursive method. 我已将问题隔离到通过尾递归方法顺序处理数据集的操作。

What is causing this problem? 是什么导致了这个问题？ How can I go about determining the cause myself? 我怎样才能自己确定原因？

Answer 1

The problem turns out to be a stack overflow (ha!) occurring on the worker. 问题结果是工人身上发生堆栈溢出（ha！）。

On a hunch, I rewrote the operation to be performed entirely on the driver (effectively disabling Spark functionality). 在预感中，我重写了完全在驱动程序上执行的操作（有效地禁用了Spark功能）。 When I ran this code, the system still crashed, but now displayed a StackOverflowError . 当我运行此代码时，系统仍然崩溃，但现在显示StackOverflowError 。 Contrary to what I previously believed, apparently tail-recursive methods can definitely cause a stack overflow just like any other form of recursion. 与我之前认为的相反，显然尾递归方法肯定会导致堆栈溢出，就像任何其他形式的递归一样。 After rewriting the method to no longer use recursion, the problem disappeared. 重写方法后不再使用递归，问题就消失了。

A stack overflow is probably not the only problem that can produce the original FileNotFoundException, but making a temporary code change which pulls the operation to the driver seems to be a good way to determine the actual cause of the problem. 堆栈溢出可能不是唯一可以产生原始FileNotFoundException的问题，但是进行临时代码更改会将操作拉到驱动程序似乎是确定问题的实际原因的好方法。

Spark worker在临时shuffle文件上抛出FileNotFoundException

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-10-19 08:24:10

Spark worker在临时shuffle文件上抛出FileNotFoundException

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-10-19 08:24:10

解决方案1
3 已采纳 2017-10-19 08:24:10