简体   繁体   English

从Java中以网页递归检索链接

[英]Recursively retrieving links from web page in Java

I'm working on a simplified website downloader (Programming Assignment) and I have to recursively go through the links in the given url and download the individual pages to my local directory. 我正在开发一个简化的网站下载程序(编程分配),我必须递归浏览给定网址中的链接并将各个页面下载到我的本地目录。

I already have a function to retrieve all the hyperlinks(href attributes) from a single page, Set<String> retrieveLinksOnPage(URL url) . 我已经有一个函数从单个页面中检索所有超链接(href属性), Set<String> retrieveLinksOnPage(URL url) This function returns a vector of hyperlinks. 此函数返回超链接的向量。 I have been told to download pages up to level 4. (Level 0 being the Home Page) Therefore I basically want to retrieve all the links in the site but I'm having difficulty coming up with the recursion algorithm. 我被告知下载页面到4级。(0级是主页)因此我基本上想要检索站点中的所有链接,但是我很难提出递归算法。 In the end, I intend to call my function like this : 最后,我打算像这样调用我的函数:

retrieveAllLinksFromSite("http://www.example.com/ldsjf.html",0)

Set<String> Links=new Set<String>();
Set<String> retrieveAllLinksFromSite (URL url, int Level,Set<String> Links)
{
    if(Level==4)
       return;
    else{

        //retrieveLinksOnPage(url,0);
        //I'm pretty Lost Actually!
        }

}

Thanks! 谢谢!

Here is the pseudo code: 这是伪代码:

Set<String> retrieveAllLinksFromSite(int Level, Set<String> Links) {
    if (Level < 5) {
        Set<String> local_links =  new HashSet<String>();
        for (String link : Links) {
            // do download link
            Set<String> new_links = ;// do parsing the downloaded html of link;
            local_links.addAll(retrieveAllLinksFromSite(Level+1, new_links));
        }
        return local_links;
    } else {
        return Links;
    }

}

You will need to implement thing in the comments yourself. 您需要自己在评论中实现。 To run the function from a given single link, you need to create an initial set of links which contains only one initial link. 要从给定的单个链接运行该函数,您需要创建一组初始链接,该链接仅包含一个初始链接。 However, it also works if you ahve multiple initial links. 但是,如果您有多个初始链接,它也可以工作。

Set<String> initial_link_set = new HashSet();
initial_link_set.add("http://abc.com/");
Set<String> final_link_set = retrieveAllLinksFromSite(1, initial_link_set);

You can use a HashMap instead of a Vector to store the links and their levels (since you need to recursively get all links down to level 4) 您可以使用HashMap而不是Vector来存储链接及其级别(因为您需要递归获取所有链接到级别4)

Also , it would be something like this(just giving an overall hint) : 此外,它会是这样的(只是给出一个整体提示):

HashMap Links=new HashMap();

void retrieveAllLinksFromSite (URL url, int Level)
{
    if(Level==4)
       return;
    else{
        retrieve the links on current page and for each retrieved link,
        do {
           download the link
           Links.put(the retrieved url,Level)  // store the link with level in hashmap
           retrieveAllLinksFromSite (the retrieved url ,Level+1) //recursively call for

 further levels
            }

        }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM