简体   繁体   English

Puppeteer - 从表格中以正确的格式抓取数据

[英]Puppeteer - scrape data from table in correct format

I've been working on a puppeteer app to scrape some data.我一直在开发一个 puppeteer 应用程序来抓取一些数据。

I've got this code which works fine but could be improved to give me the data I want to improve it to get the data in a structured way that I can work with.我有这段代码可以正常工作,但可以改进以提供我想要改进的数据,以便以我可以使用的结构化方式获取数据。

const table1 = await page.$$eval('table:nth-child(3) tbody', tbodys => tbodys.map((tbody) => {
  return tbody.innerText;
}));

So tbody allows me to scrape all the TR and TD tags no matter the number of those in the table however I've a problem in that my table has a table row and within that table row it has two table cells.所以 tbody 允许我刮掉所有的 TR 和 TD 标签,不管表中有多少标签,但是我有一个问题,我的表有一个表行,而在该表行中它有两个表单元格。 The first TD is the header of the data in the second TD.第一个TD是第二个TD中数据的头部。

So I have the following HTML:所以我有以下 HTML:

<tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>

body > center > table > tbody > tr:nth-child(2) > td:nth-child(2) > div:nth-child(3) > table:nth-child(3) > tbody > tr:nth-child(2)

//THIS IS THE BODY WHICH MY ORIGINAL CODE IS PULLING OUT THE TEXT OF. MY CODE LOOKS AT TDS ONLY WITHIN TRs.
<tbody><tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Status:</strong></td>//HEADER
    <td valign="top">Wrong&nbsp;</td> //VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>Created:</strong></td>//HEADER
    <td valign="top">2019-09-09 17:18:53&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Modified:</strong></td>//HEADER
    <td valign="top">2019-09-09 17:21:19&nbsp;</td>//VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>User:</strong></td>//HEADER
    <td valign="top">fbibsan&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>BMS Account:</strong></td> //HEADER
    <td valign="top">ABC123 SAS. (SAS)&nbsp;</td> //VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>Mode:</strong></td>//HEADER
    <td valign="top">FAF&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Type:</strong></td>
    <td valign="top">BOP&nbsp;</td>
</tr>
</tbody>

The structure I need is for each row in the table:我需要的结构是表中的每一行:

HEADER:'VALUE'

I hope someone could help.我希望有人可以提供帮助。 I'd be very grateful as I've spent days searching now.我会很感激,因为我现在已经花了几天的时间在寻找。

Depends on what the type of you tbody is in the map callback.取决于地图回调中的 tbody 类型。 Hoping you can parse that tbody object somehow.希望你能以某种方式解析那个 tbody 对象。

I think you just need additional parsing and probably just need to add some logic to your existing function.我认为您只需要额外的解析,并且可能只需要向现有函数添加一些逻辑。

Here's what i would do:这是我会做的:

const table1 = await page.$$eval('table:nth-child(3) tbody', tbodys => tbodys.map((tbody) => {
  // add logic here!
  let parsedTable = '';
  let extractedTRs = tbody.match(/<tr>(.*?)<\/tr>/g); // find a way to deconstruct this or regex. what is the type of tbody?
  extractedTRs.map( tr => {
  const tr= str.match(/<td>(.*?)<\/td>/g); //this should return an array...someone check me :)
  parsedTable += `tr[0]:'${tr[1]}' \\n`); 
}
  
  return parsedTable;
}));

If I undestand the task correctly, here is a simplified example how to get structured data from a table:如果我正确理解任务,这里是一个如何从表中获取结构化数据的简化示例:

const html = `
  <!doctype html>
  <html>
    <head><meta charset='UTF-8'><title>Test</title></head>
    <body>
      <table><tbody>
        <tr><th>Header</th><th>Header</th></tr>
        <tr><td>Key 1</td><td>Value 1</td></tr>
        <tr><td>Key 2</td><td>Value 2</td></tr>
      </tbody></table>
  </html>`;

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
    const [page] = await browser.pages();

    await page.goto(`data:text/html,${html}`);

    const data = await page.evaluate(() => {
      const dataObject = {};
      const tbody = document.querySelector('table tbody');

      for (const row of tbody.rows) {
        if (!row.querySelector('td')) continue; // Skip headers.

        const [keyCell, valueCell] = row.cells;
        dataObject[keyCell.innerText] = valueCell.innerText;
      }
      return dataObject;
    });

    console.log(data); // { 'Key 1': 'Value 1', 'Key 2': 'Value 2' }

    // await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM