简体   繁体   English

工作人员完成后将新任务提交给执行者

[英]Submit new task to executor after worker finishes

I am working on a web crawler that visits a page and extracts the link to look for a specific domain, if it does not find it it views the extracted links and repeats until it hits a page limit or finds the page. 我正在研究访问页面并提取链接以查找特定域的Web搜寻器,如果找不到,它将查看提取的链接并重复直到达到页面限制或找到页面为止。 I find myself struggling to come up with sound logic to have the bot continue to queue tasks after it extracts the links because the tasks are being completed quickly and not enough time is given to submit the newly extracted links. 我发现自己正在努力提出合理的逻辑,以使该僵尸程序在提取链接后继续让任务排队,因为该任务正在快速完成,并且没有足够的时间来提交新提取的链接。 How could I go about implementing that the crawler wait until it has no more links before shutting down the executor? 我该如何实施,使搜寻器在没有执行程序的情况下等待其关闭,然后再关闭执行器? I have included a basic overview of my multi threading implementation. 我已经包括了我的多线程实现的基本概述。 I set the max threads to 3, and submit example.com 10 times (Seed domains) 我将最大线程数设置为3,并提交example.com 10次(种子域)

Spawn Thread visits the site and extracts the links then returns them to a string. Spawn Thread访问站点并提取链接,然后将其返回为字符串。 My issue is that I need to be able to take those results and then put them into the queue. 我的问题是我需要能够获得这些结果,然后将其放入队列。 But the queue has already finished by that time. 但是到那时队列已经完成。 Any suggestions? 有什么建议么?

Update So to clarify, my issue is that when I submit a seed, and get the results, I cannot get it to continue searching the returned seeds. 更新因此,为了澄清起见,我的问题是,当我提交种子并获得结果时,我无法获取它来继续搜索返回的种子。 Unless I block and wait for results and then add them in manually. 除非我阻止并等待结果,然后手动添加它们。

Update 2 To clarify a bit more, I am trying to prevent blocking from occurring on future.get so I can add the returned results as they come to be scheduled as tasks. 更新2为了进一步说明,我试图防止在future.get发生阻塞,因此我可以添加返回的结果,因为它们将被安排为任务。

            int MaxThreads = 3;
            ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(MaxThreads); // How many threads
            List<Future<String>> resultList = new ArrayList<>();// Create results list

            for (int i = 0; i < 10; i ++) {
                SpawnThread task  = new SpawnThread("example.com");// Create Tasks
                Future<String> result = executor.submit(task);//Launch tasks
                //System.out.println("Added " + CurrentNum + " to the que!");
                resultList.add(result);//Store Task Result
            }

             for(Future<String> future : resultList) //Loop through results
                {
                    String resultfinished;
                    try {
                        resultfinished = future.get();
                        System.out.println(resultfinished);
                    } catch (InterruptedException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    } catch (ExecutionException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }

                }
            executor.shutdown();

I think what I need is a non blocking queue for the results that can be added back in to the list that is giving new domains to crawl, but I cannot seem to get it to work. 我认为我需要的是一个不阻塞的队列,可以将结果重新添加到允许新域进行爬网的列表中,但是我似乎无法使其正常工作。

BlockingQueue queue = new ArrayBlockingQueue(1024);
        Executor executor = Executors.newFixedThreadPool(4);
        CompletionService<List<String>> completionService = 
                 new ExecutorCompletionService<List<String>>(executor);
        List<String> pagesToVisit = new ArrayList<String>();
        Set<String> pagesVisited = new HashSet<String>();

        String SeedPage = "https://example.com/";
        String currentURL = null;

        boolean done = false;
        while(!done) {

             int listsize = pagesToVisit.size();
             if(pagesToVisit.isEmpty())
             {
                 currentURL = SeedPage;
                 pagesVisited.add(SeedPage);
                 listsize = pagesToVisit.size() + 1;
              }
             else
             {
                 currentURL = nextUrl();
             }


             for(int k = 0; k < listsize; k ++)
             {

                 completionService.submit(new Spider(currentURL,"IP","PORT" ) {
                 });
             }

              int received = 0;
              boolean errors = false;
              while(received < listsize  && !errors)
              {
                  Thread.sleep(1000);
                  Future<List<String>> resultFuture = completionService.take(); //blocks if none available
                  try
                  {
                      List<String> result = resultFuture.get();
                      pagesToVisit.addAll(result);
                      received ++; 
                  }
                  catch(Exception e)
                  {
                               //log
                            e.printStackTrace();
                            errors = true;
                  }
              }

          }

I'm not sure if I got you question right but 我不确定我是否让你问对了,但是

You can use awaitTermination(); 您可以使用awaitTermination(); method 方法

public boolean awaitTermination(long timeout, TimeUnit unit) throws InterruptedException 公共布尔awaitTermination(长时间超时,TimeUnit单位)引发InterruptedException

Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first. 阻塞直到关闭请求后所有任务完成执行,或者发生超时,或者当前线程被中断(以先发生的为准)。

Parameters: timeout - the maximum time to wait unit - the time unit of the timeout argument 参数:timeout-等待的最大时间单位-timeout参数的时间单位

Returns: true if this executor terminated and false if the timeout elapsed before termination 返回:如果此执行程序终止,则返回true;如果终止之前超时,则返回false

Throws: InterruptedException - if interrupted while waiting 抛出:InterruptedException-如果在等待时被中断

For example 例如

try{
executor.awaitTermination(5, TimeUnit.Seconds);
}catch(InterruptedException e)
{
// Catch block
}

shutdown() method does not wait for threads to complete shutdown()方法不等待线程完成

Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. 启动有序关闭,在该关闭中执行先前提交的任务,但不接受任何新任务。 Invocation has no additional effect if already shut down. 如果已关闭,则调用不会产生任何其他影响。 This method does not wait for previously submitted tasks to complete execution. 此方法不等待先前提交的任务完成执行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM