如何使用jsoup获取到站点页面的可导航链接？

Question

I am implementing a basic crawler with the purpose of later use in a vulnerability scanner. 我正在实现一个基本的搜寻器，其目的是以后在漏洞扫描器中使用。 I am using jsoup for the connection/retrieving and parsing of html document. 我正在使用jsoup进行html文档的连接/检索和解析。

I supply manually the base/root of the intended site(www.example.com) and connect. 我手动提供了目标站点（www.example.com）的基础/根目录并进行连接。

...
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
...

Then i retrieve all the links on the page. 然后，我检索页面上的所有链接。

...
Elements linksOnPage = htmlDocument.select("a[href]");
...

After this I loop between the links and try to get the links to all the pages on the site. 此后，我在链接之间循环，并尝试获取指向站点上所有页面的链接。

for (Element link : linksOnPage) {
                this.links.add(link.absUrl("href"));
    }

The problem is as follows. 问题如下。 Depending on the links I get, some might not be links to new pages or not even links to pages at all. 根据我获得的链接，有些链接可能不是到新页面的链接，甚至根本不是到页面的链接。 As an example a got links like: 例如，有一个类似的链接：

https://example.example.com/webmail https://example.example.com/webmail
http://193.231.21.13 http://193.231.21.13
mailto:example.example@exampl.com mailto：example.example@exampl.com

What i need some help whit is the filtering of the links so that i get only links to new pages of the same root/base site. 我需要帮助的地方是链接的过滤，以便我仅获得指向相同根目录/基础站点的新页面的链接。

Answer 1

This is easy. 这很容易。 Check if absUrl ends with image format or js or css: 检查absUrl是否以图片格式或js或css结尾：

if(absUrl.startsWith("http://www.ics.uci.edu/") && !absUrl.matches(".*\\.(bmp|gif|jpg|png|js|css)$")) 
{

    //here absUrl starts with domain name and is not image or js or css
}

如何使用jsoup获取到站点页面的可导航链接？

问题描述

1 个解决方案

解决方案1
1 2017-06-13 15:54:26

如何使用jsoup获取到站点页面的可导航链接？

问题描述

1 个解决方案

解决方案1 1 2017-06-13 15:54:26

解决方案1
1 2017-06-13 15:54:26