简体   繁体   中英

Submit new task to executor after worker finishes

I am working on a web crawler that visits a page and extracts the link to look for a specific domain, if it does not find it it views the extracted links and repeats until it hits a page limit or finds the page. I find myself struggling to come up with sound logic to have the bot continue to queue tasks after it extracts the links because the tasks are being completed quickly and not enough time is given to submit the newly extracted links. How could I go about implementing that the crawler wait until it has no more links before shutting down the executor? I have included a basic overview of my multi threading implementation. I set the max threads to 3, and submit example.com 10 times (Seed domains)

Spawn Thread visits the site and extracts the links then returns them to a string. My issue is that I need to be able to take those results and then put them into the queue. But the queue has already finished by that time. Any suggestions?

Update So to clarify, my issue is that when I submit a seed, and get the results, I cannot get it to continue searching the returned seeds. Unless I block and wait for results and then add them in manually.

Update 2 To clarify a bit more, I am trying to prevent blocking from occurring on future.get so I can add the returned results as they come to be scheduled as tasks.

            int MaxThreads = 3;
            ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(MaxThreads); // How many threads
            List<Future<String>> resultList = new ArrayList<>();// Create results list

            for (int i = 0; i < 10; i ++) {
                SpawnThread task  = new SpawnThread("example.com");// Create Tasks
                Future<String> result = executor.submit(task);//Launch tasks
                //System.out.println("Added " + CurrentNum + " to the que!");
                resultList.add(result);//Store Task Result
            }

             for(Future<String> future : resultList) //Loop through results
                {
                    String resultfinished;
                    try {
                        resultfinished = future.get();
                        System.out.println(resultfinished);
                    } catch (InterruptedException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    } catch (ExecutionException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }

                }
            executor.shutdown();

I think what I need is a non blocking queue for the results that can be added back in to the list that is giving new domains to crawl, but I cannot seem to get it to work.

BlockingQueue queue = new ArrayBlockingQueue(1024);
        Executor executor = Executors.newFixedThreadPool(4);
        CompletionService<List<String>> completionService = 
                 new ExecutorCompletionService<List<String>>(executor);
        List<String> pagesToVisit = new ArrayList<String>();
        Set<String> pagesVisited = new HashSet<String>();

        String SeedPage = "https://example.com/";
        String currentURL = null;

        boolean done = false;
        while(!done) {

             int listsize = pagesToVisit.size();
             if(pagesToVisit.isEmpty())
             {
                 currentURL = SeedPage;
                 pagesVisited.add(SeedPage);
                 listsize = pagesToVisit.size() + 1;
              }
             else
             {
                 currentURL = nextUrl();
             }


             for(int k = 0; k < listsize; k ++)
             {

                 completionService.submit(new Spider(currentURL,"IP","PORT" ) {
                 });
             }

              int received = 0;
              boolean errors = false;
              while(received < listsize  && !errors)
              {
                  Thread.sleep(1000);
                  Future<List<String>> resultFuture = completionService.take(); //blocks if none available
                  try
                  {
                      List<String> result = resultFuture.get();
                      pagesToVisit.addAll(result);
                      received ++; 
                  }
                  catch(Exception e)
                  {
                               //log
                            e.printStackTrace();
                            errors = true;
                  }
              }

          }

I'm not sure if I got you question right but

You can use awaitTermination(); method

public boolean awaitTermination(long timeout, TimeUnit unit) throws InterruptedException

Blocks until all tasks have completed execution after a shutdown request, or the timeout occurs, or the current thread is interrupted, whichever happens first.

Parameters: timeout - the maximum time to wait unit - the time unit of the timeout argument

Returns: true if this executor terminated and false if the timeout elapsed before termination

Throws: InterruptedException - if interrupted while waiting

For example

try{
executor.awaitTermination(5, TimeUnit.Seconds);
}catch(InterruptedException e)
{
// Catch block
}

shutdown() method does not wait for threads to complete

Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. Invocation has no additional effect if already shut down. This method does not wait for previously submitted tasks to complete execution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM