简体   繁体   English

ExecutorService永远不会停止。在另一个执行任务中执行新任务时

[英]ExecutorService never stops. When execute new task inside another executing task

Good day. 美好的一天。

I have blocker issue with my web crawler project. 我的网络抓取工具有阻塞问题。 Logic is simple. 逻辑很简单。 First creates one Runnable , it downloads html document, scans all links and then on all funded links it creates new Runnable objects. 首先创建一个Runnable ,它下载html文档,扫描所有链接,然后在所有资助的链接上创建新的Runnable对象。 Each new created Runnable in its turn creates new Runnable objects for each link and execute them. 每个新创建的Runnable都会为每个链接创建新的Runnable对象并执行它们。

Problem is that ExecutorService never stops. 问题是ExecutorService永远不会停止。

CrawlerTest.java CrawlerTest.java

public class CrawlerTest {

    public static void main(String[] args) throws InterruptedException {
        new CrawlerService().crawlInternetResource("https://jsoup.org/");
    }
}

CrawlerService.java CrawlerService.java

import java.io.IOException;
import java.util.Collections;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CrawlerService {

    private Set<String> uniqueUrls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>(10000));
    private ExecutorService executorService = Executors.newFixedThreadPool(8);
    private String baseDomainUrl;

    public void crawlInternetResource(String baseDomainUrl) throws InterruptedException {
        this.baseDomainUrl = baseDomainUrl;
        System.out.println("Start");
        executorService.execute(new Crawler(baseDomainUrl)); //Run first thread and scan main domain page. This thread produce new threads.
        executorService.awaitTermination(10, TimeUnit.MINUTES);
        System.out.println("End");
    }

    private class Crawler implements Runnable { // Inner class that encapsulates thread and scan for links

        private String urlToCrawl;

        public Crawler(String urlToCrawl) {
            this.urlToCrawl = urlToCrawl;
        }

        public void run() {
            try {
                findAllLinks();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

        private void findAllLinks() throws InterruptedException {
            /*Try to add new url in collection, if url is unique adds it to collection, 
             * scan document and start new thread for finded links*/
            if (uniqueUrls.add(urlToCrawl)) { 
                System.out.println(urlToCrawl);

                Document htmlDocument = loadHtmlDocument(urlToCrawl);
                Elements findedLinks = htmlDocument.select("a[href]");

                for (Element link : findedLinks) {
                    String absLink = link.attr("abs:href");
                    if (absLink.contains(baseDomainUrl) && !absLink.contains("#")) { //Check that we are don't go out of domain
                        executorService.execute(new Crawler(absLink)); //Start new thread for each funded link
                    }
                }
            }
        }

        private Document loadHtmlDocument(String internetResourceUrl) {
            Document document = null;
            try {
                document = Jsoup.connect(internetResourceUrl).ignoreHttpErrors(true).ignoreContentType(true)
                        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0")
                        .timeout(10000).get();
            } catch (IOException e) {
                System.out.println("Page load error");
                e.printStackTrace();
            }
            return document;
        }
    }
}

This app need about 20 secs to scan jsoup.org for all unique links. 此应用程序需要大约20秒来扫描jsoup.org以获取所有唯一链接。 But it just wait 10 minutes executorService.awaitTermination(10, TimeUnit.MINUTES); 但它只需等待10分钟executorService.awaitTermination(10, TimeUnit.MINUTES); and then I see dead main thread and still working executor. 然后我看到死主线程仍在工作执行器。

Threads 主题

How to force ExecutorService work correctly? 如何强制ExecutorService正常工作?

I think problem is that it invoke executorService.execute inside another task instead in main thread. 我认为问题是它在主线程中调用另一个任务中的executorService.execute。

You are misusing awaitTermination . 你是在滥用awaitTermination According to javadoc you should call shutdown first: 根据javadoc你应该首先调用shutdown

Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first. 阻止所有任务在关闭请求之后完成执行,或者发生超时,或者当前线程被中断,以先发生者为准。

To achieve your goal I'd suggest to use CountDownLatch (or latch that support increments like this one ) to determine exact moment when there is no tasks left so you safely can do shutdown . 为了实现你的目标,我建议使用CountDownLatch (或支持像这样的增量的锁存器)来确定没有任务的确切时刻,这样你就可以安全地shutdown

I see your comment from earlier: 我从前面看到你的评论:

I can't use CountDownLatch because I don't know beforehand how many unique links I will collect from resource. 我不能使用CountDownLatch,因为我事先不知道我将从资源中收集多少个唯一链接。

First off, vsminkov is spot on with the answer as to why awaitTermniation will sit and wait for 10 minutes. 首先,vsminkov就是为什么awaitTermniation将等待10分钟的答案。 I will offer an alternate solution. 我会提供另一种解决方案。

Instead of using a CountDownLatch use a Phaser . 而不是使用CountDownLatch使用Phaser For each new task, you can register, and await completion. 对于每个新任务,您都可以注册并等待完成。

Create a single phaser and register each time a execute.submit is invoked and arrive each time a Runnable completes. 创建一个移相器并在每次调用execute.submit register并在每次Runnable完成时arrive

public void crawlInternetResource(String baseDomainUrl) {
    this.baseDomainUrl = baseDomainUrl;

    Phaser phaser = new Phaser();
    executorService.execute(new Crawler(phaser, baseDomainUrl)); 
    int phase = phaser.getPhase();
    phase.awaitAdvance(phase);
}

private class Crawler implements Runnable { 

    private final Phaser phaser;
    private String urlToCrawl;

    public Crawler(Phaser phaser, String urlToCrawl) {
        this.urlToCrawl = urlToCrawl;
        this.phaser = phaser;
        phaser.register(); // register new task
    }

    public void run(){
       ...
       phaser.arrive(); //may want to surround this in try/finally
    }

You are not calling shutdown. 你不是在叫停机。

This may work - An AtomicLong variable in the CrawlerService. 这可能有效--CrawlerService中的AtomicLong变量。 Increment before every new sub task is submitted to executor service. 在将每个新子任务提交给执行程序服务之前递增。

Modify your run() method to decrement this counter and if 0, shutdown the executor service 修改run()方法以递减此计数器,如果为0,则关闭执行程序服务

public void run() {
    try {
        findAllLinks();
    } catch (InterruptedException e) {
        e.printStackTrace();
    } finally {
        //decrements counter
        //If 0, shutdown executor from here or just notify CrawlerService who would be doing wait().
    }
}

In the "finally", reduce the counter and when the counter is zero, shutdown executor or just notify CrawlerService. 在“finally”中,减少计数器,当计数器为零时,关闭执行程序或只是通知CrawlerService。 0 means, this is the last one, no other is running, none pending in queue. 0表示,这是最后一个,没有其他正在运行,没有在队列中挂起。 No task will submit any new sub tasks. 没有任务会提交任何新的子任务。

How to force ExecutorService work correctly? 如何强制ExecutorService正常工作?

I think problem is that it invoke executorService.execute inside another task instead in main thread. 我认为问题是它在主线程中调用另一个任务中的executorService.execute。

No. The problem is not with ExecutorService. 不。问题不在于ExecutorService。 You are using APIs in incorrect manner and hence not getting right result. 您使用的API方式不正确,因此无法获得正确的结果。

You have to use three APIs in a certain order to get right result. 您必须按特定顺序使用三个API才能获得正确的结果。

1. shutdown
2. awaitTermination
3. shutdownNow

Recommended way from oracle documentation page of ExecutorService : ExecutorService的 oracle文档页面推荐的方法:

 void shutdownAndAwaitTermination(ExecutorService pool) {
   pool.shutdown(); // Disable new tasks from being submitted
   try {
     // Wait a while for existing tasks to terminate
     if (!pool.awaitTermination(60, TimeUnit.SECONDS)) {
       pool.shutdownNow(); // Cancel currently executing tasks
       // Wait a while for tasks to respond to being cancelled
       if (!pool.awaitTermination(60, TimeUnit.SECONDS))
           System.err.println("Pool did not terminate");
     }
   } catch (InterruptedException ie) {
     // (Re-)Cancel if current thread also interrupted
     pool.shutdownNow();
     // Preserve interrupt status
     Thread.currentThread().interrupt();
   }

shutdown(): Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. shutdown():启动有序关闭,其中执行先前提交的任务,但不接受任何新任务。

shutdownNow(): Attempts to stop all actively executing tasks, halts the processing of waiting tasks, and returns a list of the tasks that were awaiting execution. shutdownNow():尝试停止所有正在执行的任务,停止等待任务的处理,并返回等待执行的任务列表。

awaitTermination(): Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first. awaitTermination():阻塞,直到所有任务在关闭请求完成后执行,或发生超时,或者当前线程被中断,以先发生者为准。

On a different note: If you want to wait for all tasks to complete, refer to this related SE question: 另请注意:如果要等待所有任务完成,请参阅此相关的SE问题:

wait until all threads finish their work in java 等到所有线程完成他们在java中的工作

I prefer using invokeAll() or ForkJoinPool() , which are best suited for your use case. 我更喜欢使用invokeAll()ForkJoinPool() ,它们最适合您的用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM