如何使用 puppeteer 從網站獲取所有鏈接

Question

好吧，我想要一種使用 puppeteer 和 for 循環來獲取站點上的所有鏈接並將它們添加到數組中的方法，在這種情況下，我想要的鏈接不是 html 標簽中的鏈接，它們是鏈接直接在源代碼中，javascript 文件鏈接等...我想要這樣的東西：

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

但是我怎樣才能得到所有對 javascript 樣式文件的引用以及網站源代碼中的所有 URL 呢？ 我只是找到了一個帖子和一個問題，它們教導或展示了它如何從標簽中獲取鏈接，而不是從源代碼中獲取所有鏈接。

例如，假設您想獲取此頁面上的所有標簽：

查看來源： https://www.nike.com/

如何獲取所有腳本標簽並返回到控制台？ 我放了view-source:https://nike.com因為你可以獲得腳本標簽，我不知道你是否可以在不顯示源代碼的情況下做到這一點，但我考慮過顯示和獲取腳本標簽，因為那是我有這個想法，但是我不知道該怎么做

Answer 1

可以僅使用 node.js 從 URL 獲取所有鏈接，無需 puppeteer：

主要有兩個步驟：

獲取 URL 的源代碼。
解析鏈接的源代碼。

node.js 中的簡單實現：

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://stackoverflow.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

示例用法：

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

筆記：

為了簡單起見，我使用了 axios 庫並避免來自 nike.com 的“拒絕訪問”錯誤。 可以使用任何其他方法來獲取 HTML 源，例如：
- 原生 node.js http/https 庫
- Puppeteer（使用 puppeteer 獲取完整的 web 頁面源 html - 但某些部分總是缺失）

Answer 2

盡管其他答案適用於許多情況，但它們不適用於客戶端呈現的網站。 例如，如果您只是向 Reddit 發出 Axios 請求，您將得到的只是幾個帶有一些元數據的 div。 由於 Puppeteer 實際上獲取頁面並在真實瀏覽器中解析所有 JavaScript，因此網站對文檔呈現的選擇與提取頁面數據無關。

Puppeteer 在頁面 object 上有一個evaluate方法，可以讓你直接在頁面上運行 JavaScript。 使用它，您可以輕松提取所有鏈接，如下所示：

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

Answer 3

是的，您無需打開查看源即可獲取所有腳本標簽及其鏈接。 您需要在項目中添加jsdom庫的依賴項，然后將 HTML 響應傳遞給其實例，如下所示

這是代碼：

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

在這里，我為其中具有src屬性的<script>標簽制作了 CSS 選擇器。

您可以使用 puppeteer 編寫相同的代碼，但是打開瀏覽器和所有內容然后獲取其 pageSource 需要一些時間。

您可以使用它來查找鏈接，然后使用 puppeteer 或任何東西對它們進行任何您想使用的操作。

如何使用 puppeteer 從網站獲取所有鏈接

問題描述

3 個解決方案

解決方案1
6 已采納 2021-06-05 03:27:07

解決方案2
2 2021-10-06 11:22:42

解決方案3
1 2021-06-05 23:54:03

如何使用 puppeteer 從網站獲取所有鏈接

問題描述

3 個解決方案

解決方案1 6 已采納 2021-06-05 03:27:07

解決方案2 2 2021-10-06 11:22:42

解決方案3 1 2021-06-05 23:54:03

解決方案1
6 已采納 2021-06-05 03:27:07

解決方案2
2 2021-10-06 11:22:42

解決方案3
1 2021-06-05 23:54:03