简体   繁体   English

如何递归获取所有网站链接?

[英]How can I get all website links recursively?

I need to write a code which will get all the links in a website recursively.我需要编写一个代码来递归地获取网站中的所有链接。 Since I'm new to this is what I've got so far;由于我是新手,这是我迄今为止所拥有的;

List<WebElement> no = driver.findElements(By.tagName("a"));
nooflinks = no.size();
for (WebElement pagelink : no)
{
    String linktext = pagelink.getText();
    link = pagelink.getAttribute("href"); 
}

Now what I need to do is if the list finds a link of the same domain, then it should get all the links from that URL and then return back to the previous loop and resume from the next link.现在我需要做的是,如果列表找到相同域的链接,那么它应该从该 URL 获取所有链接,然后返回上一个循环并从下一个链接恢复。 This should go on till the last URL in the Whole Website is found.这应该一直持续到找到整个网站中的最后一个 URL。 That is for example, Home Page is base URL and it has 5 URLs of other pages, then after getting the first of the 5 URLs the loop should get all the links of that first URL return back to Home Page and resume from second URL.也就是说,主页是基本 URL,它有 5 个其他页面的 URL,然后在获取 5 个 URL 中的第一个后,循环应该让第一个 URL 的所有链接返回主页并从第二个 URL 恢复。 Now if second URL has Sub-sub URL, then the loop should find links for those first then resume to second URL and then go back to Home Page and resume from third URL.现在,如果第二个 URL 具有子子 URL,那么循环应该首先找到那些链接,然后恢复到第二个 URL,然后返回主页并从第三个 URL 恢复。

Can anybody help me out here???有人可以帮我吗???

I saw this post recently.我最近看到了这个帖子。 I don't know if you are still looking for ANY solution for this problem.我不知道你是否还在寻找这个问题的任何解决方案。 If not, I thought it might be useful:如果没有,我认为它可能有用:

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Iterator;
public class URLReading {
public static void main(String[] args) {
 try {
    String url="";
    HashMap<String, String> h = new HashMap<>(); 
    Url = "https://abidsukumaran.wordpress.com/";
    Document doc = Jsoup.connect(url).get();
  
    //  Page Title
    String title = doc.title();
   //System.out.println("title: " + title);
 
  //  Links in page
  Elements links = doc.select("a[href]");
  List url_array = new ArrayList();
  int i=0;
  url_array.add(url);
  String root = url;
  h.put(url, title);
  Iterator<String> keySetIterator = h.keySet().iterator();
  while((i<=h.size())){
      try{
          url = url_array.get(i).toString();
      doc = Jsoup.connect(url).get();
      title = doc.title();
      links = doc.select("a[href]");
      
    for (Element link : links) {
         
   String res= h.putIfAbsent(link.attr("href"), link.text());
   if (res==null){
   url_array.add(link.attr("href"));
   System.out.println("\nURL: " + link.attr("href"));
   System.out.println("CONTENT: " + link.text());
   }
  } 
 }catch(Exception e){
        System.out.println("\n"+e);
      }
 
      i++;
 
     }
     } catch (Exception e) {
     e.printStackTrace();
     }
    }
   }

You can use Set and HashSet .您可以使用SetHashSet You may try like this:你可以这样尝试:

Set<String> getLinksFromSite(int Level, Set<String> Links) {
    if (Level < 5) {
        Set<String> locallinks =  new HashSet<String>();
        for (String link : Links) {
            Set<String> new_links = ;
            locallinks.addAll(getLinksFromSite(Level+1, new_links));
        }
        return locallinks;
    } else {
        return Links;
    }

}

I would think the following idiom would be useful in this context:我认为以下习语在这种情况下会很有用:

Set<String> visited = new HashSet<>();
Deque<String> unvisited = new LinkedList<>();

unvisited.add(startingURL);
while (!unvisited.isEmpty()) {
    String current = unvisited.poll();
    visited.add(current);
    for /* each link in current */ {
        if (!visited.contains(link.url())
            unvisited.add(link.url());
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Jsoup-如何在一个数组中获取所有链接和标题? - Jsoup - How can i get all links and titles in one array? 如何递归获取图的所有连接节点,其中可以包含循环? - How can I recursively get all connected nodes of graph, which can contain cycles? 如何从通用列表递归获取范围内的所有元素? - How can I get all elements in range from a generic List recursively? 我如何在一个html页面中获取包含的所有文件的总大小(递归为css文件)? - how can i get the total size of all files contained (recursively for css file )in one html page? 我如何从网站上获取所有元素 - How can i get all the elements from a website 如何在网站中获取所有超链接及其段落? - how can i get all the hyperlinks and its paragraphs in an website? 如何使用 Java 从网站获取所有 cookie - How can I get all cookies from a website using Java 如何以编程方式获取网站中所有页面的URL - How can I programatically get URL of all pages in a website 如何在网站上递归查找所有URL — Java - How to find all URLs recursively on a website — java 如何将工件和所有依赖项递归下载到目录中? - How can I recursively download an artifact and all dependencies into a directory?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM