简体   繁体   English

如何使用 puppeteer 从网站获取所有链接

[英]How to get all links from a website with puppeteer

Well, I would like a way to use the puppeteer and the for loop to get all the links on the site and add them to an array, in this case the links I want are not links that are in the html tags, they are links that are directly in the source code, javascript file links etc... I want something like this:好吧,我想要一种使用 puppeteer 和 for 循环来获取站点上的所有链接并将它们添加到数组中的方法,在这种情况下,我想要的链接不是 html 标签中的链接,它们是链接直接在源代码中,javascript 文件链接等...我想要这样的东西:

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

But how can I get all references to javascript style files and all URLs that are in the source code of a website?但是我怎样才能得到所有对 javascript 样式文件的引用以及网站源代码中的所有 URL 呢? I just find a post and a question that teaches or shows how it gets the links from the tag and not all the links from the source code.我只是找到了一个帖子和一个问题,它们教导或展示了它如何从标签中获取链接,而不是从源代码中获取所有链接。

Supposing you want to get all the tags on this page for example:例如,假设您想获取此页面上的所有标签:

view-source: https://www.nike.com/查看来源: https://www.nike.com/

How can I get all script tags and return to console?如何获取所有脚本标签并返回到控制台? I put view-source:https://nike.com because you can get the script tags, I don't know if you can do it without displaying the source code, but I thought about displaying and getting the script tag because that was the idea I had, however I do not know how to do it我放了view-source:https://nike.com因为你可以获得脚本标签,我不知道你是否可以在不显示源代码的情况下做到这一点,但我考虑过显示和获取脚本标签,因为那是我有这个想法,但是我不知道该怎么做

It is possible to get all links from a URL using only node.js, without puppeteer:可以仅使用 node.js 从 URL 获取所有链接,无需 puppeteer:

There are two main steps:主要有两个步骤:

  1. Get the source code for the URL.获取 URL 的源代码。
  2. Parse the source code for links.解析链接的源代码。

Simple implementation in node.js: node.js 中的简单实现:

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

Sample usage:示例用法:

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

Notes:笔记:

Although the other answers are applicable in many situations, they will not work for client-side rendered sites.尽管其他答案适用于许多情况,但它们不适用于客户端呈现的网站。 For instance, if you just do an Axios request to Reddit, all you'll get is a couple of divs with some metadata.例如,如果您只是向 Reddit 发出 Axios 请求,您将得到的只是几个带有一些元数据的 div。 As Puppeteer actually gets the page and parses all JavaScript in a real browser, the websites' choice of document rendering becomes irrelevant for extracting page data.由于 Puppeteer 实际上获取页面并在真实浏览器中解析所有 JavaScript,因此网站对文档呈现的选择与提取页面数据无关。

Puppeteer has an evaluate method on the page object which allows you to run JavaScript directly on the page. Puppeteer 在页面 object 上有一个evaluate方法,可以让你直接在页面上运行 JavaScript。 Using that, you easily extract all links as follows:使用它,您可以轻松提取所有链接,如下所示:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

yes you can get all the script tags and their links without opening view source.是的,您无需打开查看源即可获取所有脚本标签及其链接。 You need to add dependency for jsdom library in your project and then pass the HTML response to its instance like below您需要在项目中添加jsdom库的依赖项,然后将 HTML 响应传递给其实例,如下所示

here is the code:这是代码:

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

Here I have made CSS selector for <script> tags that have src attribute inside them.在这里,我为其中具有src属性的<script>标签制作了 CSS 选择器。

You can write same code in using puppeteer, but it will take time opening the browser and everything and then getting its pageSource.您可以使用 puppeteer 编写相同的代码,但是打开浏览器和所有内容然后获取其 pageSource 需要一些时间。

you can use this to find the links and then do whatever you want to use with them using puppeteer or anything.您可以使用它来查找链接,然后使用 puppeteer 或任何东西对它们进行任何您想使用的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM