简体   繁体   English

活动任务在 Spark UI 中是一个负数

[英]Active tasks is a negative number in Spark UI

When using and , I saw this:使用,我看到了:

在此处输入图片说明

where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks).您看到活动任务是负数(总任务数与已完成任务数的差值)。

What is the source of this error?这个错误的根源是什么?


Node that I have many executors.我有很多执行者的节点。 However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally.然而,似乎有一个任务似乎一直处于空闲状态(我没有看到任何进展),而另一个相同的任务正常完成。


Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k executors.这也是相关的:该邮件我可以确认正在创建许多任务,因为我使用的是 1k 或 2k 执行程序。

The error I am getting is a bit different:我得到的错误有点不同:

16/08/15 20:03:38 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/08/15 20:07:18 WARN TaskSetManager: Lost task 20652.0 in stage 4.0 (TID 116652, myfoo.com): FetchFailed(BlockManagerId(61, mybar.com, 7337), shuffleId=0, mapId=328, reduceId=20652, message=
org.apache.spark.shuffle.FetchFailedException: java.util.concurrent.TimeoutException: Timeout waiting for task.

It is a Spark issue.这是一个 Spark 问题。 It occurs when executors restart after failures.它发生在执行程序失败后重新启动时。 The JIRA issue for the same is already created.已经创建了相同的 JIRA 问题。 You can get more details about the same from https://issues.apache.org/jira/browse/SPARK-10141 link.您可以从https://issues.apache.org/jira/browse/SPARK-10141链接获取更多详细信息。

Answered in the Spark-dev mailing list from S. Owen , there are several JIRA tickets that are relevant to this issue, such as:S. Owen的 Spark-dev 邮件列表中回答了几个与此问题相关的 JIRA 票证,例如:

  1. ResourceManager UI showing negative value ResourceManager UI 显示负值
  2. NodeManager reports negative running containers NodeManager 报告负面的运行容器

This behavior usually occurs when (many) executors restart after failure(s).当(许多)执行程序在失败后重新启动时,通常会发生此行为。


This behavior can also occur when the application uses too many executors.当应用程序使用太多执行程序时,也会发生此行为。 Use coalesce() to fix this case.使用coalesce()来解决这种情况。

To be exact, in Prepare my bigdata with Spark via Python , I had >400k partitions.确切地说,在通过 Python 使用 Spark 准备我的大数据中,我有 > 400k 分区。 I used data.coalesce(1024) , as described in Repartition an RDD, and I was able to bypass that Spark UI bug.我使用了data.coalesce(1024) ,如重新分区 RDD 中所述,并且我能够绕过该 Spark UI 错误。 You see, partitioning, is a very important concept when it comes to Distributed Computing and Spark.你看,分区是分布式计算和 Spark 中一个非常重要的概念。

In my question I also use 1-2k executors, so it must be related.在我的问题中,我也使用了 1-2k 个执行程序,所以它一定是相关的。

Note: Too few partitions and you might experience this Spark Java Error: Size exceeds Integer.MAX_VALUE .注意:分区太少,您可能会遇到此Spark Java 错误:大小超过 Integer.MAX_VALUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM