How to get all links from a website with puppeteer

Question

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.

Supposing you want to get all the tags on this page for example:

view-source: https://www.nike.com/

How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

Answer 1

It is possible to get all links from a URL using only node.js, without puppeteer:

There are two main steps:

Get the source code for the URL.
Parse the source code for links.

Simple implementation in node.js:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

Sample usage:

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

Notes:

I used the axios library for simplicity and to avoid "access denied" errors from nike.com. It is possible to use any other method to get the HTML source, like:
- Native node.js http/https libraries
- Puppeteer ( Get complete web page source html with puppeteer - but some part always missing )

Answer 2

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.

Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

Answer 3

yes you can get all the script tags and their links without opening view source. You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below

here is the code:

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

Here I have made CSS selector for <script> tags that have src attribute inside them.

You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.

you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

How to get all links from a website with puppeteer

Question

3 answers

solution1
6 ACCPTED 2021-06-05 03:27:07

solution2
2 2021-10-06 11:22:42

solution3
1 2021-06-05 23:54:03

How to get all links from a website with puppeteer

Question

3 answers

solution1 6 ACCPTED 2021-06-05 03:27:07

solution2 2 2021-10-06 11:22:42

solution3 1 2021-06-05 23:54:03

solution1
6 ACCPTED 2021-06-05 03:27:07

solution2
2 2021-10-06 11:22:42

solution3
1 2021-06-05 23:54:03