简体   繁体   中英

Get all outbound links from a Website using javascript

I am working on developing chrome extensions and am relatively new to this field. The project I am currently working on requires my extension to scan all the "webpages" present on the current "website" (that is the website that is currently open in the active tab of the browser). I need to get and print a list of all outbound links from the website (and not just the currently open webpage).

Progress so far: Using the chrome tabs API I have so far managed to get a list of all the outbound links from the currently active webpage. I fetch the URL and then using the query functions and a small script that makes use of document.links, I have been able to do this successfully for a single page.

Problem: I need to convert this into an iterative solution and be able to scan all the links from the current page, hit these links one-by-one and the repeat the process for each of the links and finally add the newly found links from them to the existing set of lists.

I understand this is not a trivial problem and basically need some guidance for the approach I should use. I haven't been able to hit the links discreetly without opening them in a new tab. I need a way to do this. It would be great if someone can guide me through this. Thanks!!

I wouldn't bother implementing this yourself when it has been done before. You could try the solutions from this SO question for example to gather all the links:

How to find all links / pages on a website

As mentioned in the comments, XHR(XMLHTTPRequest()) did the trick!! Here is the code I am using now (hopefully it can help anyone else with a similar problem)

var xmlHttp = null;
var allLinks = []; //set of all internal and external links
function httpGet(theUrl)
{
    xmlHttp = new XMLHttpRequest();
    xmlHttp.open( "GET", theUrl, true );
    xmlHttp.send( null );
    xmlHttp.onreadystatechange = ProcessRequest;
}

function ProcessRequest()
{
    if ( xmlHttp.readyState == 4 && xmlHttp.status == 200 )
    {           
            var container = document.createElement("p");
            container.innerHTML = xmlHttp.responseText;
            var anchors = container.getElementsByTagName("a");
            var list = [];
             for (var i = 0; i < anchors.length; i++) 
             {
                var href = anchors[i].href;
                var exists = 0;
                for(var j = 0; j < allLinks.length; j++)    // remove duplicates
                    if(allLinks[j] == href)
                        exists = 1;
                if (exists == 0)
                {
                    allLinks.push(href);
                    document.getElementById('printLinks').innerHTML += href + "<br />";
                }
             }
        }
}

This does the job well and this way I can hit and analyse each URL from the list and keep adding newly found URLs.

Courtesy: StackOverflow questions and other blogs :)

Just apply some filter to the Below mentioned script and you will be Good to go. I might update this answer in the future as soon as I get some time.

//Extracting no of inbound and outbound links

Links = document.querySelectorAll('a'); 
for (link in Links) 
console.log(Links[link].href);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM