简体   繁体   中英

How to recursively get all the links present in a website using Javascript?

I'm looking for the way to recursively find all the links present on any given website. I know how to do this in java but I don't know how it can be done using javascript.

Consider this image represents a website directory and if we provide 'www.abc.com' ,then it should return following output.


www.abc.com\images
www.abc.com\files
www.abc.com\images\a.jpg
www.abc.com\images\b.jpg
www.abc.com\files\aa.txt
www.abc.com\files\bb.txt

Since the question is tagged jQuery, I'll use that. Simply target the a tags.

var linksList = [];
function addLink(url){
    if(url!= "" && linksList.indexOf(url) == -1){
        links.list.push(url);
        scrapePage(url);
    }
}
function scrapePage(url){
    $.get(url,function(html){
        var $iframe = $('body').append('iframe');
        $iframe.contents().find("body").html(html);
        $iframe.contents().find("body a").each(function(index,link){
            addLink(link.href);
        });
        $iframe.remove();
    });
}
$("body a").each(function(index,link){
    addLink(link.href);
});

Pretty simple, a function to add links in our list, another to follow the links we add. I decided to put the content of the scraped page inside an iframe to keep everything restrained...

You'll want to add your logic to make sure it takes only links that are from the domain. You might need to play with the URL as it will not be absolute (but considered it is in my code). And so on.

I think you cannot get all the links of a particular website. But you can get all the link of particular page like below :-

var allLinks = document.getElementsByTagName("a");

Hope it helps. It would be great if you elaborate your issue more.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM