[英]Capture a screenshot a a table using Puppeteer
I am learning to scrape items from a website using Puppeteer.我正在学习使用 Puppeteer 从网站上抓取项目。 I am using table data from Basketball reference.com to practice.
我正在使用来自 Basketball reference.com 的表格数据进行练习。 What I have done so far is use the puppeteer to Search the stats of my favorite player (Stephen Curry), access the table page, and take a screenshot of the page which then finishes the scraping process and closes the browser.
到目前为止,我所做的是使用 puppeteer 搜索我最喜欢的球员(斯蒂芬库里)的统计数据,访问表格页面,并截取页面截图,然后完成抓取过程并关闭浏览器。 However, I cannot seem to scrape the table I need and I am completely stuck.
但是,我似乎无法刮掉我需要的桌子,我完全被卡住了。
The following is the code I have written so far:以下是我到目前为止编写的代码:
const puppeteer = require("puppeteer");
async function run() {
const browser = await puppeteer.launch({
headless: false,
ignoreHTTPSErrors: true,
});
const page = await browser.newPage();
await page.goto(`https://www.basketball-reference.com/`);
await page.waitForSelector("input[name=search]");
await page.$eval("input[name=search]", (el) => (el.value = "Stephen Curry"));
await page.click('input[type="submit"]');
await page.waitForSelector(`a[href='${secondPageLink}']`, { visible: true });
await page.click(`a[href='${secondPageLink}']`);
await page.waitForSelector();
await page.screenshot({
path: `StephenCurryStats.png`,
});
await page.close();
await browser.close();
}
run();
I am trying to scrape the PER GAME table on the following link and take its screenshot.我正在尝试在以下链接上抓取 PER GAME 表并截取其屏幕截图。 However, I cannot seem to find the right selector to pick and scrape and I am very confused.
但是,我似乎找不到合适的选择器来挑选和抓取,我很困惑。
The URL is https://www.basketball-reference.com/players/c/curryst01.html URL 是https://www.basketball-reference.com/players/c/curryst01.html
There seems to be at least a couple of issues here.这里似乎至少有几个问题。 I'm not sure what
secondPageLink
refers to or the intent behind await page.waitForSelector()
(throws TypeError: Cannot read properties of undefined (reading 'startsWith')
on my version).我不确定
secondPageLink
指的是什么或await page.waitForSelector()
背后的意图(抛出TypeError: Cannot read properties of undefined (reading 'startsWith')
on my version)。 I would either select the first search result with .search-item-name a[href]
or skip that page entirely by clicking on the first autocompleted name in the search after using page.type()
.我要么 select 使用 .search
.search-item-name a[href]
搜索第一个搜索结果,要么在使用page.type()
后单击搜索中的第一个自动完成名称完全跳过该页面。 Even better, you can build the query string URL (eg https://www.basketball-reference.com/search/search.fcgi?search=stephen+curry ) and navigate to that in your first goto
.更好的是,您可以构建查询字符串 URL(例如https://www.basketball-reference.com/search/search.fcgi?search=stephen+curry )并在您的第一个
goto
中导航到该字符串。
The final page loads a video and a ton of Google ad junk.最后一页加载了一个视频和大量的谷歌广告垃圾。 Best to block all requests that aren't relevant to the screenshot.
最好阻止所有与屏幕截图无关的请求。
const puppeteer = require("puppeteer"); // ^16.2.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const url = "https://www.basketball-reference.com/";
await page.setViewport({height: 600, width: 1300});
await page.setRequestInterception(true);
const allowed = [
"https://www.basketball-reference.com",
"https://cdn.ssref.net"
];
page.on("request", request => {
if (allowed.some(e => request.url().startsWith(e))) {
request.continue();
}
else {
request.abort();
}
});
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.type('input[name="search"]', "Stephen Curry");
const $ = sel => page.waitForSelector(sel);
await (await $(".search-results-item")).click();
await (await $(".adblock")).evaluate(el => el.remove());
await page.waitForNetworkIdle();
await page.screenshot({
path: "StephenCurryStats.png",
fullPage: true
});
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
If you just want to capture the per game table:如果您只想捕获每个游戏桌:
// same boilerplate above this line
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.type('input[name="search"]', "Stephen Curry");
const $ = sel => page.waitForSelector(sel);
await (await $(".search-results-item")).click();
const table = await $("#per_game");
await (await page.$(".scroll_note"))?.click();
await table.screenshot({path: "StephenCurryStats.png"});
But I'd probably want a CSV for maximum ingestion:但我可能想要一个 CSV 来最大程度地摄取:
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.type('input[name="search"]', "Stephen Curry");
const $ = sel => page.waitForSelector(sel);
await (await $(".search-results-item")).click();
const btn = await page.waitForFunction(() =>
[...document.querySelectorAll("#all_per_game-playoffs_per_game li button")]
.find(e => e.textContent.includes("CSV"))
);
await btn.evaluate(el => el.click());
const csv = await (await $("#csv_per_game"))
.evaluate(el => [...el.childNodes].at(-1).textContent.trim());
const table = csv.split("\n").map(e => e.split(",")); // TODO use proper CSV parser
console.log(table);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.