简体   繁体   English

如何忽略已经访问过的域? Java | Jsoup

[英]How to ignore a domain that has already been visited? Java | Jsoup

OK so I am starting a Bing search, then retrieving a couple resulting urls and using those as starting points to traverse other pages, parsing links from them and adding them to a List. 好的,所以我开始进行Bing搜索,然后检索几个生成的url,并使用它们作为遍历其他页面的起点,从中解析链接并将它们添加到列表中。

The problem I'm having is, I don't want to visit the same domain twice. 我遇到的问题是,我不想访问同一域两次。 I can stop it from visiting the same URL but if a page has link to another part of the website (such as an about page) I can't. 我可以阻止它访问相同的URL,但是如果某个页面具有指向网站另一部分的链接(例如“关于”页面),则无法。 Currently I've a LinkedList where I add a URL to every time I parse one from the document using Jsoup. 当前,我有一个LinkedList,每次使用Jsoup从文档中解析一个URL时,都会在其中添加URL。 And I have a HashMap for storing already visited URLs. 我有一个HashMap用于存储已访问的URL。 So I have it set up in a basic "if" like this: 因此,我将其设置在基本的“ if”中,如下所示:

if(!urlsVisited.containsKey(url))
{
    urlsToVisit.add(url);
    urlsVisited.put(url, url); 
}

This is in a for loop where I retrieve the links on each page (currently 4 threads handling 4 pages). 这是在for循环中,在该循环中我检索每个页面上的链接(当前有4个线程处理4个页面)。

This stops it from adding the likes of " http://www.stackoverflow.com " twice but doesn't work if I were to come across " http://www.stackoverflow.com/questions/ask ". 这样可以阻止它两次添加“ http://www.stackoverflow.com ”之类的东西,但是如果我碰到“ http://www.stackoverflow.com/questions/ask ”则无法正常工作。

I would like to add one link from StackOverflow (for example) and then be done with that domain. 我想从StackOverflow添加一个链接(例如),然后对该域进行处理。 Any ideas? 有任何想法吗?

I'm using Jsoup api in Java to parse results. 我在Java中使用Jsoup api解析结果。

Use the java.net.URL class to pull the host name, and use that as the key to your urlsVisited map. 使用java.net.URL类提取主机名,并将其用作urlsVisited映射的键。

http://docs.oracle.com/javase/6/docs/api/java/net/URL.html#getHost() http://docs.oracle.com/javase/6/docs/api/java/net/URL.html#getHost()

You can use URI class to parse your URLs. 您可以使用URI类来解析您的URL。 I also recommend to use Set<String> to store visited domains: 我还建议使用Set<String>来存储访问的域:

Set<String> urlsVisited = new HashSet<String>();
...

String domain = new URI(url).getHost();
if(!urlsVisited.contains(domain))
{
    urlsToVisit.add(url);
    urlsVisited.add(domain); 
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM