简体繁体 English

在 Yarn 集群上运行时，Spark 批处理未完成

[英]Spark batches does not complete when running on Yarn cluster

原文 2018-02-21 21:09:49 3 1 hadoop/ apache-spark/ spark-streaming/ couchbase/ hadoop-yarn

Setting the scene设置场景
I am working to make a Spark streaming application (Spark 2.2.1 with Scala) run on a Yarn cluster (Hadoop 2.7.4).我正在努力使 Spark 流应用程序（带有 Scala 的 Spark 2.2.1）在 Yarn 集群（Hadoop 2.7.4）上运行。

So far I managed to submit the application to the Yarn cluster with spark-submit.到目前为止，我设法使用 spark-submit 将应用程序提交到 Yarn 集群。 I can see that the receiver task starts up correctly and fetches a lot of records from the database (Couchbase Server 5.0) and I can also see that the records are divided into batches.我可以看到接收器任务正确启动并从数据库（Couchbase Server 5.0）中获取了大量记录，并且我还可以看到记录被分成批次。

The question问题
When I look at the Streaming Statistics on the Spark Web UI, I can however see that my batches are never processed.但是，当我查看 Spark Web UI 上的 Streaming Statistics 时，我可以看到我的批次从未被处理过。 I have seen batches with 0 records process and complete but when a batch with records start processing it never completes.我见过有 0 条记录的批次处理并完成，但是当有记录的批次开始处理时，它永远不会完成。 One time it even got stuck on a batch with 0 records.有一次它甚至卡在了 0 条记录的批次上。

I even tried simplifying the output operations on the SteamingContext as much as possible.我什至尝试尽可能简化 SteamingContext 上的输出操作。 But still with the very simple output operation print() my batches are never processed.但是仍然使用非常简单的输出操作 print() 我的批次从未被处理过。 The logs does not show any warnings or errors.日志不显示任何警告或错误。

Does anyone know what might be wrong?有谁知道可能有什么问题？ Any suggestions on how to solve this will be much appreciated.任何有关如何解决此问题的建议将不胜感激。

More Info更多信息
The main class of the Spark application is built from this example (first one) from the Couchbase Spark Connector documentation combined with this example with checkpoint from the Spark Documentation. Spark 应用程序的主类是从 Couchbase Spark 连接器文档中的这个示例（第一个）与这个示例以及 Spark 文档中的检查点相结合构建的。

Right now I have 3230 Active Batches (3229 queued and 1 processing) and 1 Completed Batch (that had 0 records) and the application has been running for 4 hours and 30 minutes... and another batch is added every 5 seconds.现在我有 3230 个活动批次（3229 个排队和 1 个处理）和 1 个已完成的批次（有 0 个记录）并且应用程序已经运行了 4 小时 30 分钟......并且每 5 秒添加另一个批次。

If I look at the "thread dump" for the executors I see a lot of WAITING, TIMED WAITING and a few RUNNABLE threads.如果我查看执行程序的“线程转储”，我会看到很多 WAITING、TIMED WAITING 和一些 RUNNABLE 线程。 The list will fill up 3 screenshots, so i will only post it if needed.该列表将填满 3 个屏幕截图，因此我只会在需要时发布它。

Below you will find some screenshots from the Web UI您会在下面找到一些来自 Web UI 的屏幕截图

Executor Overview执行器概览

Spark Jobs Overview Spark 作业概述

Node Overview with resources带资源的节点概览

Capacity Scheduler Overview容量调度程序概述

1 个解决方案

Per screenshot, you have 2 cores and 1 is being used for driver and another is being used for receiver.根据屏幕截图，您有 2 个内核，其中 1 个用于驱动程序，另一个用于接收器。 You don't have a core for the actual processing to happen.您没有进行实际处理的核心。 Please increase the number of cores and try again.请增加内核数，然后重试。

Refer: https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers参考： https : //spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

If you are using an input DStream based on a receiver (eg sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data.如果您使用基于接收器（例如套接字、Kafka、Flume 等）的输入 DStream，那么将使用单线程运行接收器，不留下任何线程来处理接收到的数据。 Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).因此，在本地运行时，始终使用“local[n]”作为主 URL，其中 n > 要运行的接收器数量（有关如何设置主服务器的信息，请参阅 Spark 属性）。