简体   繁体   English

使用 Puppeteer 抓取 Header 行和值表

[英]Using Puppeteer to Scrape a Header Row and a Table of Values

I would like to scrape the following data.table from a website:我想从网站上抓取以下 data.table:

<body style="background-color:grey;">
  <div class="table" id="myTable" style="display: table;">
    <div class="tr" style="background-color: #4CAF50; color: white;">
      <div class="td tnic">Nickname</div>
      <div class="td tsrv">Server IP</div>
      <div class="td tip">IP</div>
      <div class="td treg">Region</div>
      <div class="td tcou">Country</div>
      <div class="td tcit">City</div>
      <div class="td tscr">Score <input type="checkbox" onchange="mysrt(this)" id="chkscr"></div>
      <div class="td tupd">Update Time <input type="checkbox" onchange="mysrt(this)" id="chkupd" checked="" disabled="">
      </div>
      <div class="td taut">Auth Key</div>
      <div class="td town">Key Owner</div>
      <div class="td tver">Version</div>
      <div class="td tdet">Details</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">Player 1</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.226.35</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit">Cleveland</div>
      <div class="td tscr">21</div>
      <div class="td tupd">2022-12-29 10:17:01 (GMT-8)</div>
      <div class="td taut">SecretauthK3y</div>
      <div class="td town">CoolName</div>
      <div class="td tver">7.11</div>
      <div class="td tdet">FPS: 93 @ 0(0) ms @ 0 K/m</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">PlayerB</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.90.221</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit">Mechanicsville</div>
      <div class="td tscr">67991</div>
      <div class="td tupd">2022-12-29 10:16:56 (GMT-8)</div>
      <div class="td taut">SecretauthK3y2</div>
      <div class="td town">PlayerB</div>
      <div class="td tver">7.12</div>
      <div class="td tdet">FPS: 50 @ 175(243) ms @ 0 K/m</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">McChicken</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.39.80</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit"></div>
      <div class="td tscr">0</div>
      <div class="td tupd">2022-12-29 09:41:44 (GMT-8)</div>
      <div class="td taut">SecretauthK3y3</div>
      <div class="td town">SOLO KEY</div>
      <div class="td tver">7.12</div>
      <div class="td tdet">FPS: 63 @ 0(0) ms @ 0 K/m</div>
    </div>
  </div>

It has a header row under .tr and then each row of data is represented by the div with .tr mytarget .它在.tr下有一个 header 行,然后每一行数据都由带有.tr mytarget的 div 表示。 Normally there are hundreds of more .tr_mytarget rows which all have an identical format to the three shown.通常还有数百个.tr_mytarget行,它们都具有与显示的三个相同的格式。 My goal is to scrape this data in such a way that will make it easy to then perform some calculations and filtering to it.我的目标是以一种可以轻松执行一些计算和过滤的方式抓取这些数据。 It will eventually be re-used in a new data.table.它最终将在新的 data.table 中重新使用。

I have a small amount of experience with JS so my idea was to use puppeteer.我对 JS 的经验很少,所以我的想法是使用 puppeteer。 My question is twofold: In what format should I scrape the data so that it's in an appropriate format to use and how do I write the Puppeteer statements to do this?我的问题是双重的:我应该以什么格式抓取数据,以便以适当的格式使用,以及如何编写 Puppeteer 语句来执行此操作?

This is what I have so far:这是我到目前为止所拥有的:

import puppeteer from 'puppeteer';

(async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();

  await page.goto('redactedurl.com');
  await page.waitForSelector('#myTable');
  const nicks = await page.$$eval(
    '.table .tr_mytarget .td_tnic',
    allNicks => allNicks.map(td_tnick => td_tnick.textContent)
  );

  await console.log(nicks);

I dont fully understand how to write the $$eval statement.我不完全理解如何编写$$eval语句。 I'm thinking I will want one array for the header and one for the data but I'm not sure.我想我需要一个数组用于 header 和一个用于数据,但我不确定。 What's recommended?推荐什么?

To extract the data from the table in a structured format, you can use the following approach:要以结构化格式从表中提取数据,可以使用以下方法:

Extract the header row from the table and use it to create an array of column names.从表中提取 header 行并使用它创建一个列名数组。

Iterate over the rows with the mytarget class and extract the data from each cell.使用 mytarget class 遍历行并从每个单元格中提取数据。 Use the column names to create an object for each row, with the column names as the keys and the cell data as the values.使用列名为每一行创建一个 object,其中列名作为键,单元格数据作为值。

Push each row object into an array to create a final array of objects that represents the data in the table.将每一行 object 推入数组以创建表示表中数据的最终对象数组。

Here is an example of how you could do this:这是您如何执行此操作的示例:

const puppeteer = require('puppeteer');

async function scrapeTable() {
  // Launch a new browser instance
  const browser = await puppeteer.launch();

  // Create a new page
  const page = await browser.newPage();

  // Navigate to the page with the table
  await page.goto('http://example.com/table-page');

  // Extract the data from the table
  const data = await page.evaluate(() => {
    // Extract the header row
    const headerRow = document.querySelector('.table .tr');
    const columnNames = Array.from(headerRow.querySelectorAll('.td')).map(cell => cell.textContent);

    // Extract the data rows
    const dataRows = document.querySelectorAll('.table .tr.mytarget');
    const data = [];
    for (const row of dataRows) {
      // Extract the data from each cell
      const cells = row.querySelectorAll('.td');
      const rowData = {};
      for (let i = 0; i < cells.length; i++) {
        rowData[columnNames[i]] = cells[i].textContent;
      }
      data.push(rowData);
    }
    return data;
  });

  console.log(data);

  // Close the browser
  await browser.close();
}

scrapeTable();

This code will extract the data from the table and create an array of objects that represent the data in the table.此代码将从表中提取数据并创建表示表中数据的对象数组。 Each object will have the column names as the keys and the cell data as the values.每个 object 都将列名作为键,将单元格数据作为值。

I hope this helps!我希望这有帮助!

This looks like a pretty straightforward table traversal, if I understand correctly.如果我理解正确的话,这看起来像是一个非常简单的表遍历。 The problem is typical: trying to do everything in a single query call when it's better to use two;这个问题很典型:当使用两个更好时,试图在单个查询调用中完成所有事情; one for the rows, one for the columns.一个用于行,一个用于列。

Here's an example:这是一个例子:

const puppeteer = require("puppeteer"); // ^19.1.0

const html = `your HTML from above`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const data = await page.$$eval("#myTable .tr.myTarget", rows => 
    rows.map(row =>
      [...row.querySelectorAll(".td")].map(cell => cell.textContent)
    )
  );
  console.log(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This gives a 2d array of the table.这给出了表格的二维数组。 If you want an array of objects keyed by field, you can scrape the headers row, then glue it to each row of data in the array:如果你想要一个由字段键控的对象数组,你可以抓取标题行,然后将它粘附到数组中的每一行数据:

// ...
const headers = await page.$$eval("#myTable .tr:first-child .td", cells => 
  cells.map(e => e.textContent.trim())
);
const withHeaders = data.map(e =>
  Object.fromEntries(headers.map((h, i) => [h, e[i]]))
);
console.log(withHeaders);

See also:也可以看看:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM