简体   繁体   中英

How to get all links from a website with puppeteer

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.

Supposing you want to get all the tags on this page for example:

view-source: https://www.nike.com/

How can I get all script tags and return to console? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it

It is possible to get all links from a URL using only node.js, without puppeteer:

There are two main steps:

  1. Get the source code for the URL.
  2. Parse the source code for links.

Simple implementation in node.js:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

Sample usage:

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

Notes:

Although the other answers are applicable in many situations, they will not work for client-side rendered sites. For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata. As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.

Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Using that, you easily extract all links as follows:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

yes you can get all the script tags and their links without opening view source. You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below

here is the code:

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

Here I have made CSS selector for <script> tags that have src attribute inside them.

You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.

you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM