简体   繁体   中英

Active tasks is a negative number in Spark UI

When using and , I saw this:

在此处输入图片说明

where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks).

What is the source of this error?


Node that I have many executors. However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally.


Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k executors.

The error I am getting is a bit different:

16/08/15 20:03:38 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/08/15 20:07:18 WARN TaskSetManager: Lost task 20652.0 in stage 4.0 (TID 116652, myfoo.com): FetchFailed(BlockManagerId(61, mybar.com, 7337), shuffleId=0, mapId=328, reduceId=20652, message=
org.apache.spark.shuffle.FetchFailedException: java.util.concurrent.TimeoutException: Timeout waiting for task.

It is a Spark issue. It occurs when executors restart after failures. The JIRA issue for the same is already created. You can get more details about the same from https://issues.apache.org/jira/browse/SPARK-10141 link.

Answered in the Spark-dev mailing list from S. Owen , there are several JIRA tickets that are relevant to this issue, such as:

  1. ResourceManager UI showing negative value
  2. NodeManager reports negative running containers

This behavior usually occurs when (many) executors restart after failure(s).


This behavior can also occur when the application uses too many executors. Use coalesce() to fix this case.

To be exact, in Prepare my bigdata with Spark via Python , I had >400k partitions. I used data.coalesce(1024) , as described in Repartition an RDD, and I was able to bypass that Spark UI bug. You see, partitioning, is a very important concept when it comes to Distributed Computing and Spark.

In my question I also use 1-2k executors, so it must be related.

Note: Too few partitions and you might experience this Spark Java Error: Size exceeds Integer.MAX_VALUE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM