[英]How to scrape a new page using puppeteer?
我嘗試使用 puppeteer 和 Node.js 來抓取 Reddit。 有我的代碼,我在哪里:
const puppeteer = require("puppeteer");
const self = {
browser: null,
page: null,
initialize: async () => {
browser = await puppeteer.launch({
headless: false,
});
page = await browser.newPage();
// Go to the index page of Reddit
await page.goto("https://old.reddit.com/", { waitUntil: "networkidle0" });
},
getResults: async () => {
let platform = "Reddit";
// Get all posts on the main page of Reddit.
let mentions = await page.$$('#siteTable > div[class *= "thing"]');
let results = [];
// For each post:
for (let mention of mentions) {
let content = "";
// I get the link to its content page.
let content_URL = await mention.$eval(
'p[class="title"] > a[class*="title"]',
(node) => node.getAttribute("href").trim()
);
// if it is a inner link:
if (content_URL.substr(0, 3) === "/r/") {
// Create a new page to open that content page.
let contentPage = await browser.newPage();
await contentPage.goto("https://old.reddit.com" + content_URL, {
waitUntil: "networkidle0",
});
// Get the first paragraph of this content page.
content = await contentPage.evaluate((contentPage) => {
// Here is where the error occurred:
// Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined
let firstParagraph = contentPage.querySelector(
'div[class*="usertext-body"] > p'
);
if (firstParagraph != null) {
return firstParagraph.innerText.trim();
} else {
return null;
}
});
}
results.push({
title,
content,
image,
date,
popularity,
platform,
});
}
return results;
},
};
module.exports = self;
但是發生了錯誤: Error: Evaluation failed: TypeError: Cannot read property 'querySelector' of undefined
。
誰能指出我做錯了什么?
謝謝!
page.evaluate
基本上在瀏覽器的上下文中執行代碼。 IE:與您放入瀏覽器開發人員控制台以獲得相同結果的相同內容。 因此,在這種情況下,您可能希望使用document.querySelector()
而不是對未定義的contentPage
的引用:
let firstParagraph = document.querySelector(
'div[class*="usertext-body"] > p'
);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.