简体   繁体   中英

What is the best way to check each link of a website?

I want to create a crawler that follows each link of a site and check the URL to see if it works. Now my code opens the URL using url.openStream() .

So what is the best way to create a crawler?

Use a HTML parser like Jsoup .

Set<String> validLinks = new HashSet<String>();
Set<String> invalidLinks = new HashSet<String>();

Document document = Jsoup.connect("http://example.com").get();
Elements links = document.select("a");

for (Element link : links) {
    String url = link.absUrl("href");

    if (!validLinks.contains(url) && !invalidLinks.contains(url)) {
        try {
            int statusCode = Jsoup.connect(url).execute().statusCode();

            if (200 <= statusCode && statusCode < 400) {
                validLinks.add(url);
            } else {
                invalidLinks.add(url);
            }
        } catch (Exception e) {
            invalidLinks.add(url);
        }
    }
}

You may want to send a HEAD instead inside that loop to make it more efficient, but then you'll have to use URLConnection instead as Jsoup by design doesn't support it (a HEAD returns no content).

Use the internal link analyzer tool to analyze the links search engine spiders can detect on a specific page of your website. Search ... Best practices internal links. Number of links: Back in 2008, Matt Cutts (head of Google's Web-spam team) recommended limiting the number of links to a maximum of 100 links per page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM