Concurrency for recursive webcrawler-algorithm in Java

Question

I wrote a program in Java to find all pages of a website, starting with the URL of the startpage (using Jsoup as webcrawler). It is ok for small websites but too slow for sites with 200 or more pages:

public class SiteInspector {

private ObservableSet<String> allUrlsOfDomain; // all URLS found for site
private Set<String> toVisit; // pages that were found but not visited yet
private Set<String> visited; // URLS that were visited
private List<String> invalid; // broken URLs

public SiteInspector() {...}

public void getAllWebPagesOfSite(String entry) //entry must be startpage of a site
{
    toVisit.add(entry);
    allUrlsOfDomain.add(entry);
    while(!toVisit.isEmpty())
    {
        String next = popElement(toVisit);
        getAllLinksOfPage(next);  //expensive
        toVisit.remove(next);
    }
}


public void getAllLinksOfPage(String pageURL) {
    try {

        if (urlIsValid(pageURL)) {
            visited.add(pageURL);
            Document document = Jsoup.connect(pageURL).get();  //connect to pageURL (expensive network operation)
            Elements links = document.select("a");             //get all links from page 
            for(Element link : links)
            {
                String nextUrl = link.attr("abs:href");            // "http://..."
                if(nextUrl.contains(new URL(pageURL).getHost()))  //ignore URLs to external hosts
                {
                    if(!isForbiddenForCrawlers(nextUrl))           // URLS forbidden by robots.txt
                    {
                        if(!visited.contains(nextUrl))
                        {
                            toVisit.add(nextUrl);
                        }
                    }
                    allUrlsOfDomain.add(nextUrl);
                }
            }
        } 
        else
        {
            invalid.add(pageURL); //URL-validation fails
        }
    } 
    catch (IOException e) {
        e.printStackTrace();
    }
}

private boolean isForbiddenForCrawlers(String url){...}
private boolean urlIsValid(String url) {...}
public String popElement(Set<String> set) {...}

I know I have to run the expensive network-operation in extra threads.

Document document = Jsoup.connect(pageURL).get();  //connect to pageURL

My problem is that I have no idea how to properly outsource this operation while keeping the sets consistent (how to synchronize?). If possible I want to use a ThreadPoolExecutor to control the amount of threads that is getting started during the process. Do you guys have an idea how to solve this? Thanks in advance.

Answer 1

To use threads and also keep the sets consistent, you just need to create a thread that receives the variable you want to add to the Set but created empty, so the thread fills it when done and then adds it to the Set.

A simple example of that could be:

Main.class

 for (String link : links) { String validUrl = null; taskThread = new Thread( new WebDownloadThreadHanlder(link, validUrl, barrier)); taskThread.start(); if (validUrl != null) { allUrlsOfDomain.add(validUrl); } } barrier.acquireUninterruptibly(links.size());

WebDownloadThreadHandler.class

 public class WebDownloadThreadHandler implements Runnable { private String link; private String validUrl; private Semaphore barrier; public ScopusThreadHandler(String link, String validUrl, Semaphore barrier) { this.link = link; this.validUrl = null; this.barrier = barrier; } public void run () { try { Document document = Jsoup.connect(this.link).userAgent("Mozilla/5.0"); Elements elements = document.select(YOUR CSS QUERY); /* YOUR JSOUP CODE GOES HERE, AND STORE THE VALID URL IN: this.validUrl = THE VALUE YOU GET; */ } catch (IOException) { e.printStackTrace(); } this.barrier.release(); } }

What you are doing here is creating a thread for every web you want to get all the links from, and storing them into variables, if you want to retrieve more than one lvalid link from every page, you can do it using a Set and adding it a to a global set (appending it). The thing is that to keep your code consistent you need to store the retrieved values in the variable you pass the thread as argument using THIS keyword.

Hope it helps! If you need anything else feel free to ask me!

Concurrency for recursive webcrawler-algorithm in Java

Question

1 answers

solution1
2 ACCPTED 2018-11-08 16:30:33

Concurrency for recursive webcrawler-algorithm in Java

Question

1 answers

solution1 2 ACCPTED 2018-11-08 16:30:33

solution1
2 ACCPTED 2018-11-08 16:30:33