简体   繁体   中英

Using Puppeteer to Scrape a Header Row and a Table of Values

I would like to scrape the following data.table from a website:

<body style="background-color:grey;">
  <div class="table" id="myTable" style="display: table;">
    <div class="tr" style="background-color: #4CAF50; color: white;">
      <div class="td tnic">Nickname</div>
      <div class="td tsrv">Server IP</div>
      <div class="td tip">IP</div>
      <div class="td treg">Region</div>
      <div class="td tcou">Country</div>
      <div class="td tcit">City</div>
      <div class="td tscr">Score <input type="checkbox" onchange="mysrt(this)" id="chkscr"></div>
      <div class="td tupd">Update Time <input type="checkbox" onchange="mysrt(this)" id="chkupd" checked="" disabled="">
      </div>
      <div class="td taut">Auth Key</div>
      <div class="td town">Key Owner</div>
      <div class="td tver">Version</div>
      <div class="td tdet">Details</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">Player 1</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.226.35</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit">Cleveland</div>
      <div class="td tscr">21</div>
      <div class="td tupd">2022-12-29 10:17:01 (GMT-8)</div>
      <div class="td taut">SecretauthK3y</div>
      <div class="td town">CoolName</div>
      <div class="td tver">7.11</div>
      <div class="td tdet">FPS: 93 @ 0(0) ms @ 0 K/m</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">PlayerB</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.90.221</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit">Mechanicsville</div>
      <div class="td tscr">67991</div>
      <div class="td tupd">2022-12-29 10:16:56 (GMT-8)</div>
      <div class="td taut">SecretauthK3y2</div>
      <div class="td town">PlayerB</div>
      <div class="td tver">7.12</div>
      <div class="td tdet">FPS: 50 @ 175(243) ms @ 0 K/m</div>
    </div>
    <div class="tr mytarget ">
      <div class="td tnic">McChicken</div>
      <div class="td tsrv">_GAME_MENU_</div>
      <div class="td tip">x.x.39.80</div>
      <div class="td treg">North America</div>
      <div class="td tcou">United States</div>
      <div class="td tcit"></div>
      <div class="td tscr">0</div>
      <div class="td tupd">2022-12-29 09:41:44 (GMT-8)</div>
      <div class="td taut">SecretauthK3y3</div>
      <div class="td town">SOLO KEY</div>
      <div class="td tver">7.12</div>
      <div class="td tdet">FPS: 63 @ 0(0) ms @ 0 K/m</div>
    </div>
  </div>

It has a header row under .tr and then each row of data is represented by the div with .tr mytarget . Normally there are hundreds of more .tr_mytarget rows which all have an identical format to the three shown. My goal is to scrape this data in such a way that will make it easy to then perform some calculations and filtering to it. It will eventually be re-used in a new data.table.

I have a small amount of experience with JS so my idea was to use puppeteer. My question is twofold: In what format should I scrape the data so that it's in an appropriate format to use and how do I write the Puppeteer statements to do this?

This is what I have so far:

import puppeteer from 'puppeteer';

(async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();

  await page.goto('redactedurl.com');
  await page.waitForSelector('#myTable');
  const nicks = await page.$$eval(
    '.table .tr_mytarget .td_tnic',
    allNicks => allNicks.map(td_tnick => td_tnick.textContent)
  );

  await console.log(nicks);

I dont fully understand how to write the $$eval statement. I'm thinking I will want one array for the header and one for the data but I'm not sure. What's recommended?

To extract the data from the table in a structured format, you can use the following approach:

Extract the header row from the table and use it to create an array of column names.

Iterate over the rows with the mytarget class and extract the data from each cell. Use the column names to create an object for each row, with the column names as the keys and the cell data as the values.

Push each row object into an array to create a final array of objects that represents the data in the table.

Here is an example of how you could do this:

const puppeteer = require('puppeteer');

async function scrapeTable() {
  // Launch a new browser instance
  const browser = await puppeteer.launch();

  // Create a new page
  const page = await browser.newPage();

  // Navigate to the page with the table
  await page.goto('http://example.com/table-page');

  // Extract the data from the table
  const data = await page.evaluate(() => {
    // Extract the header row
    const headerRow = document.querySelector('.table .tr');
    const columnNames = Array.from(headerRow.querySelectorAll('.td')).map(cell => cell.textContent);

    // Extract the data rows
    const dataRows = document.querySelectorAll('.table .tr.mytarget');
    const data = [];
    for (const row of dataRows) {
      // Extract the data from each cell
      const cells = row.querySelectorAll('.td');
      const rowData = {};
      for (let i = 0; i < cells.length; i++) {
        rowData[columnNames[i]] = cells[i].textContent;
      }
      data.push(rowData);
    }
    return data;
  });

  console.log(data);

  // Close the browser
  await browser.close();
}

scrapeTable();

This code will extract the data from the table and create an array of objects that represent the data in the table. Each object will have the column names as the keys and the cell data as the values.

I hope this helps!

This looks like a pretty straightforward table traversal, if I understand correctly. The problem is typical: trying to do everything in a single query call when it's better to use two; one for the rows, one for the columns.

Here's an example:

const puppeteer = require("puppeteer"); // ^19.1.0

const html = `your HTML from above`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const data = await page.$$eval("#myTable .tr.myTarget", rows => 
    rows.map(row =>
      [...row.querySelectorAll(".td")].map(cell => cell.textContent)
    )
  );
  console.log(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

This gives a 2d array of the table. If you want an array of objects keyed by field, you can scrape the headers row, then glue it to each row of data in the array:

// ...
const headers = await page.$$eval("#myTable .tr:first-child .td", cells => 
  cells.map(e => e.textContent.trim())
);
const withHeaders = data.map(e =>
  Object.fromEntries(headers.map((h, i) => [h, e[i]]))
);
console.log(withHeaders);

See also:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM