简体   繁体   English

如何使用jsoup获取到站点页面的可导航链接?

[英]How to get navigable links to pages from a site using jsoup?

I am implementing a basic crawler with the purpose of later use in a vulnerability scanner. 我正在实现一个基本的搜寻器,其目的是以后在漏洞扫描器中使用。 I am using jsoup for the connection/retrieving and parsing of html document. 我正在使用jsoup进行html文档的连接/检索和解析。

I supply manually the base/root of the intended site(www.example.com) and connect. 我手动提供了目标站点(www.example.com)的基础/根目录并进行连接。

...
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
...

Then i retrieve all the links on the page. 然后,我检索页面上的所有链接。

...
Elements linksOnPage = htmlDocument.select("a[href]");
... 

After this I loop between the links and try to get the links to all the pages on the site. 此后,我在链接之间循环,并尝试获取指向站点上所有页面的链接。

for (Element link : linksOnPage) {
                this.links.add(link.absUrl("href"));
    }

The problem is as follows. 问题如下。 Depending on the links I get, some might not be links to new pages or not even links to pages at all. 根据我获得的链接,有些链接可能不是到新页面的链接,甚至根本不是到页面的链接。 As an example a got links like: 例如,有一个类似的链接:

What i need some help whit is the filtering of the links so that i get only links to new pages of the same root/base site. 我需要帮助的地方是链接的过滤,以便我仅获得指向相同根目录/基础站点的新页面的链接。

This is easy. 这很容易。 Check if absUrl ends with image format or js or css: 检查absUrl是否以图片格式或js或css结尾:

if(absUrl.startsWith("http://www.ics.uci.edu/") && !absUrl.matches(".*\\.(bmp|gif|jpg|png|js|css)$")) 
{

    //here absUrl starts with domain name and is not image or js or css
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM