简体   繁体   English

网页抓取 - 如何在有 Puppeteer JS 的可用链接时进行导航

[英]Web-scraping - How to navigate whenever there is an available link with Puppeteer JS

I want to perform a web scraping for all the data within the main table body in the url https://data.anbima.com.br/debentures/AGRU12/agenda ... However as it implements pagination, I am unable to get that done easily... I came up with the following code which is not workng... I am getting the error ReferenceError: list is not defined , though I have defned it right before the while loop...我想对 url https://data.anbima.com.br/debentures/AGRU12/agenda主表体中的所有数据执行网络抓取......但是,由于它实现了分页,我无法获得这很容易完成...我想出了以下不起作用的代码...我收到错误ReferenceError: list is not defined ,尽管我在 while 循环之前ReferenceError: list is not defined它...

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(`https://data.anbima.com.br/debentures/AGRU12/agenda`);
  await page.waitForSelector('.normal-text');
  var list = [];
  while (true) {
    let nextButton;
    await page.evaluate(async () => {
      const nodeList = document.querySelectorAll(
        '.anbima-ui-table > tbody > tr'
      );
      let nodeArray = [...nodeList];
      nextButton = document.querySelector('.anbima-ui-pagination__next-button');

      let listA = nodeArray
        .map((tbody) => [...tbody.children].map((td) => [...td.children]))
        .map((tr) =>
          tr.map((span) =>
            span[0].innerHTML
              .replace('<label class="flag__children">', '')
              .replace('</label>', '')
          )
        );
      list.push(listA);
    });

    if (!nextButton) {
      break;
    } else {
      await page.goto(nextButton.href);
    }
  }

  fs.writeFile('eventDates.json', JSON.stringify(list[0], null, 2), (err) => {
    if (err) throw new Error('Something went wrong');

    console.log('well done you got the dates');
  });
  await browser.close();
})();

List is undefined inside the callback function. List 在回调函数中未定义。 You would need to return the array in page.evaluate and then use that returned array to push it to list.您需要在 page.evaluate 中返回数组,然后使用返回的数组将其推送到列表。

const list = [];
while (true) {
    let nextButton;
    const listA = await page.evaluate(async () => {
        const nodeList = document.querySelectorAll(
            '.anbima-ui-table > tbody > tr'
        );
        let nodeArray = [...nodeList];
        nextButton = document.querySelector('.anbima-ui-pagination__next-button');

        return nodeArray
            .map((tbody) => [...tbody.children].map((td) => [...td.children]))
            .map((tr) =>
                tr.map((span) =>
                    span[0].innerHTML
                        .replace('<label class="flag__children">', '')
                        .replace('</label>', '')
                )
            );
    });
    list.push(...listA);

Edit: Corrected the last line in my example.编辑:更正了我的示例中的最后一行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM