简体   繁体   English

使用 Puppeteer 捕获屏幕截图 aa 表

[英]Capture a screenshot a a table using Puppeteer

I am learning to scrape items from a website using Puppeteer.我正在学习使用 Puppeteer 从网站上抓取项目。 I am using table data from Basketball reference.com to practice.我正在使用来自 Basketball reference.com 的表格数据进行练习。 What I have done so far is use the puppeteer to Search the stats of my favorite player (Stephen Curry), access the table page, and take a screenshot of the page which then finishes the scraping process and closes the browser.到目前为止,我所做的是使用 puppeteer 搜索我最喜欢的球员(斯蒂芬库里)的统计数据,访问表格页面,并截取页面截图,然后完成抓取过程并关闭浏览器。 However, I cannot seem to scrape the table I need and I am completely stuck.但是,我似乎无法刮掉我需要的桌子,我完全被卡住了。

The following is the code I have written so far:以下是我到目前为止编写的代码:

const puppeteer = require("puppeteer");

async function run() {
  const browser = await puppeteer.launch({
    headless: false,
    ignoreHTTPSErrors: true,
  });
  const page = await browser.newPage();
  await page.goto(`https://www.basketball-reference.com/`);
  await page.waitForSelector("input[name=search]");
  await page.$eval("input[name=search]", (el) => (el.value = "Stephen Curry"));
  await page.click('input[type="submit"]');

  await page.waitForSelector(`a[href='${secondPageLink}']`, { visible: true });
  await page.click(`a[href='${secondPageLink}']`);

  await page.waitForSelector();

   await page.screenshot({
    path: `StephenCurryStats.png`,
  });
  await page.close();
  await browser.close();
}
run();

I am trying to scrape the PER GAME table on the following link and take its screenshot.我正在尝试在以下链接上抓取 PER GAME 表并截取其屏幕截图。 However, I cannot seem to find the right selector to pick and scrape and I am very confused.但是,我似乎找不到合适的选择器来挑选和抓取,我很困惑。

The URL is https://www.basketball-reference.com/players/c/curryst01.html URL 是https://www.basketball-reference.com/players/c/curryst01.html

There seems to be at least a couple of issues here.这里似乎至少有几个问题。 I'm not sure what secondPageLink refers to or the intent behind await page.waitForSelector() (throws TypeError: Cannot read properties of undefined (reading 'startsWith') on my version).我不确定secondPageLink指的是什么或await page.waitForSelector()背后的意图(抛出TypeError: Cannot read properties of undefined (reading 'startsWith') on my version)。 I would either select the first search result with .search-item-name a[href] or skip that page entirely by clicking on the first autocompleted name in the search after using page.type() .我要么 select 使用 .search .search-item-name a[href]搜索第一个搜索结果,要么在使用page.type()后单击搜索中的第一个自动完成名称完全跳过该页面。 Even better, you can build the query string URL (eg https://www.basketball-reference.com/search/search.fcgi?search=stephen+curry ) and navigate to that in your first goto .更好的是,您可以构建查询字符串 URL(例如https://www.basketball-reference.com/search/search.fcgi?search=stephen+curry )并在您的第一个goto中导航到该字符串。

The final page loads a video and a ton of Google ad junk.最后一页加载了一个视频和大量的谷歌广告垃圾。 Best to block all requests that aren't relevant to the screenshot.最好阻止所有与屏幕截图无关的请求。

const puppeteer = require("puppeteer"); // ^16.2.0

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  const url = "https://www.basketball-reference.com/";
  await page.setViewport({height: 600, width: 1300});
  await page.setRequestInterception(true);
  const allowed = [
    "https://www.basketball-reference.com",
    "https://cdn.ssref.net"
  ];
  page.on("request", request => {
    if (allowed.some(e => request.url().startsWith(e))) {
      request.continue();
    }
    else {
      request.abort();
    }
  });
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.type('input[name="search"]', "Stephen Curry");
  const $ = sel => page.waitForSelector(sel);
  await (await $(".search-results-item")).click();
  await (await $(".adblock")).evaluate(el => el.remove());
  await page.waitForNetworkIdle();
  await page.screenshot({
    path: "StephenCurryStats.png",
    fullPage: true
  });
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

If you just want to capture the per game table:如果您只想捕获每个游戏桌:

// same boilerplate above this line
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.type('input[name="search"]', "Stephen Curry");
const $ = sel => page.waitForSelector(sel);
await (await $(".search-results-item")).click();
const table = await $("#per_game");
await (await page.$(".scroll_note"))?.click();
await table.screenshot({path: "StephenCurryStats.png"});

But I'd probably want a CSV for maximum ingestion:但我可能想要一个 CSV 来最大程度地摄取:

await page.goto(url, {waitUntil: "domcontentloaded"});
await page.type('input[name="search"]', "Stephen Curry");
const $ = sel => page.waitForSelector(sel);
await (await $(".search-results-item")).click();
const btn = await page.waitForFunction(() =>
  [...document.querySelectorAll("#all_per_game-playoffs_per_game li button")]
    .find(e => e.textContent.includes("CSV"))
);
await btn.evaluate(el => el.click());
const csv = await (await $("#csv_per_game"))
  .evaluate(el => [...el.childNodes].at(-1).textContent.trim());
const table = csv.split("\n").map(e => e.split(",")); // TODO use proper CSV parser
console.log(table);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM