如何递归获取所有网站链接？

Question

I need to write a code which will get all the links in a website recursively.我需要编写一个代码来递归地获取网站中的所有链接。 Since I'm new to this is what I've got so far;由于我是新手，这是我迄今为止所拥有的；

List<WebElement> no = driver.findElements(By.tagName("a"));
nooflinks = no.size();
for (WebElement pagelink : no)
{
    String linktext = pagelink.getText();
    link = pagelink.getAttribute("href"); 
}

Now what I need to do is if the list finds a link of the same domain, then it should get all the links from that URL and then return back to the previous loop and resume from the next link.现在我需要做的是，如果列表找到相同域的链接，那么它应该从该 URL 获取所有链接，然后返回上一个循环并从下一个链接恢复。 This should go on till the last URL in the Whole Website is found.这应该一直持续到找到整个网站中的最后一个 URL。 That is for example, Home Page is base URL and it has 5 URLs of other pages, then after getting the first of the 5 URLs the loop should get all the links of that first URL return back to Home Page and resume from second URL.也就是说，主页是基本 URL，它有 5 个其他页面的 URL，然后在获取 5 个 URL 中的第一个后，循环应该让第一个 URL 的所有链接返回主页并从第二个 URL 恢复。 Now if second URL has Sub-sub URL, then the loop should find links for those first then resume to second URL and then go back to Home Page and resume from third URL.现在，如果第二个 URL 具有子子 URL，那么循环应该首先找到那些链接，然后恢复到第二个 URL，然后返回主页并从第三个 URL 恢复。

Can anybody help me out here???有人可以帮我吗？？？

Answer 1

I saw this post recently.我最近看到了这个帖子。 I don't know if you are still looking for ANY solution for this problem.我不知道你是否还在寻找这个问题的任何解决方案。 If not, I thought it might be useful:如果没有，我认为它可能有用：

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Iterator;
public class URLReading {
public static void main(String[] args) {
 try {
    String url="";
    HashMap<String, String> h = new HashMap<>(); 
    Url = "https://abidsukumaran.wordpress.com/";
    Document doc = Jsoup.connect(url).get();
  
    //  Page Title
    String title = doc.title();
   //System.out.println("title: " + title);
 
  //  Links in page
  Elements links = doc.select("a[href]");
  List url_array = new ArrayList();
  int i=0;
  url_array.add(url);
  String root = url;
  h.put(url, title);
  Iterator<String> keySetIterator = h.keySet().iterator();
  while((i<=h.size())){
      try{
          url = url_array.get(i).toString();
      doc = Jsoup.connect(url).get();
      title = doc.title();
      links = doc.select("a[href]");
      
    for (Element link : links) {
         
   String res= h.putIfAbsent(link.attr("href"), link.text());
   if (res==null){
   url_array.add(link.attr("href"));
   System.out.println("\nURL: " + link.attr("href"));
   System.out.println("CONTENT: " + link.text());
   }
  } 
 }catch(Exception e){
        System.out.println("\n"+e);
      }
 
      i++;
 
     }
     } catch (Exception e) {
     e.printStackTrace();
     }
    }
   }

Answer 2

You can use Set and HashSet .您可以使用Set和HashSet 。 You may try like this:你可以这样尝试：

Set<String> getLinksFromSite(int Level, Set<String> Links) {
    if (Level < 5) {
        Set<String> locallinks =  new HashSet<String>();
        for (String link : Links) {
            Set<String> new_links = ;
            locallinks.addAll(getLinksFromSite(Level+1, new_links));
        }
        return locallinks;
    } else {
        return Links;
    }

}

Answer 3

I would think the following idiom would be useful in this context:我认为以下习语在这种情况下会很有用：

Set<String> visited = new HashSet<>();
Deque<String> unvisited = new LinkedList<>();

unvisited.add(startingURL);
while (!unvisited.isEmpty()) {
    String current = unvisited.poll();
    visited.add(current);
    for /* each link in current */ {
        if (!visited.contains(link.url())
            unvisited.add(link.url());
    }
}

如何递归获取所有网站链接？

问题描述

3 个解决方案

解决方案1
2 2017-03-28 19:04:04

解决方案2
0 2014-07-05 19:09:34

解决方案3
0 2021-11-15 02:15:57

如何递归获取所有网站链接？

问题描述

3 个解决方案

解决方案1 2 2017-03-28 19:04:04

解决方案2 0 2014-07-05 19:09:34

解决方案3 0 2021-11-15 02:15:57

解决方案1
2 2017-03-28 19:04:04

解决方案2
0 2014-07-05 19:09:34

解决方案3
0 2021-11-15 02:15:57