簡體   English   中英

如何使用 puppeteer 抓取新頁面?

[英]How to scrape a new page using puppeteer?

我嘗試使用 puppeteer 和 Node.js 來抓取 Reddit。 有我的代碼,我在哪里:

  1. 為 Reddit 的主頁打開一個頁面,
  2. 獲取所有帖子。
  3. 對於每篇文章,我都會獲得指向其內容頁面的鏈接。
  4. 為每個內容頁面打開一個新頁面。
  5. 刮掉每個內容頁面。
const puppeteer = require("puppeteer");

const self = {
  browser: null,
  page: null,

  initialize: async () => {
    browser = await puppeteer.launch({
      headless: false,
    });
    page = await browser.newPage();

    // Go to the index page of Reddit
    await page.goto("https://old.reddit.com/", { waitUntil: "networkidle0" });
  },

  getResults: async () => {
    let platform = "Reddit";

    // Get all posts on the main page of Reddit.
    let mentions = await page.$$('#siteTable > div[class *= "thing"]');
    let results = [];

    // For each post:
    for (let mention of mentions) {
      let content = "";

      // I get the link to its content page.
      let content_URL = await mention.$eval(
        'p[class="title"] > a[class*="title"]',
        (node) => node.getAttribute("href").trim()
      );

      // if it is a inner link:
      if (content_URL.substr(0, 3) === "/r/") {

        // Create a new page to open that content page. 
        let contentPage = await browser.newPage();
        await contentPage.goto("https://old.reddit.com" + content_URL, {
          waitUntil: "networkidle0",
        });

        // Get the first paragraph of this content page.
        content = await contentPage.evaluate((contentPage) => {
          
          // Here is where the error occurred: 
          // Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined
          let firstParagraph = contentPage.querySelector(
            'div[class*="usertext-body"] > p'
          );

          if (firstParagraph != null) {
            return firstParagraph.innerText.trim();
          } else {
            return null;
          }
        });
      }

      results.push({
        title,
        content,
        image,
        date,
        popularity,
        platform,
      });
    }

    return results;
  },
};

module.exports = self;

但是發生了錯誤: Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined

誰能指出我做錯了什么?

謝謝!

page.evaluate基本上在瀏覽器的上下文中執行代碼。 IE:與您放入瀏覽器開發人員控制台以獲得相同結果的相同內容。 因此,在這種情況下,您可能希望使用document.querySelector()而不是對未定義的contentPage的引用:

let firstParagraph = document.querySelector(
  'div[class*="usertext-body"] > p'
);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM