puppeteer - 从循环中导出 JSON 文件

Question

The exported file contains only one url.导出的文件只包含一个 url。 The rest of the urls are not found in the exported file.在导出的文件中找不到其余的 url。 How can I generate a file with all the entries in the loop?如何生成包含循环中所有条目的文件？

const puppeteer = require("puppeteer");
const fs = require('fs');

let browser;
(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox']
});
  const [page] = await browser.pages();

  await page.goto('https://old.reddit.com/',{"waitUntil" : "networkidle0"});
  const a_elems = await page.$$('.thumbnail');

  for (var i=0; i<a_elems.length && i<3; i++) {            
     const elem = a_elems[i];
     const href = await page.evaluate(e => e.href, elem); 
     const newPage = await browser.newPage();
     await newPage.goto(href,{"waitUntil" : "networkidle0"});
     
        const url = await newPage.evaluate(() => document.location.href);
        console.log(url);

        fs.writeFileSync('export.json', JSON.stringify(url));
    }

    await browser.close();
})()
;

Thanks!谢谢！

Answer 1

Create an array, push each url onto it in the loop, then move your writeFile call to the end.创建一个数组，在循环中将每个url推到它上面，然后将你的writeFile调用移动到最后。

const puppeteer = require("puppeteer");
const fs = require('fs').promises;

let browser;
(async () => {
  browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox']
  });
  const [page] = await browser.pages();

  await page.goto('https://old.reddit.com/', {
    "waitUntil": "networkidle0"
  });
  const aElems = await page.$$('.thumbnail');
  const urls = [];

  for (let i = 0; i < aElems.length && i < 3; i++) {
    const href = await aElems[i].evaluate(e => e.href);
    const newPage = await browser.newPage();
    await newPage.goto(href, {waitUntil: "networkidle0"});

    const url = await newPage.evaluate(() => document.location.href);
    console.log(url);
    urls.push(url);
  }

  await fs.writeFile('export.json', JSON.stringify(urls));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Tips:提示：

You're already in async code, so writeFileSync seems suboptimal here relative to the async version.您已经在使用异步代码，因此writeFileSync相对于异步版本在这里似乎不是最理想的。
Use let instead of var so you don't get bit by i breaking scope and popping up with a stale value outside (or inside) the loop block.使用let而不是var这样您就不会因为i破坏范围并在循环块外部（或内部）弹出一个陈旧的值而受到影响。
Consider newPage.close();考虑newPage.close(); at the end of the loop.在循环结束时。 You're only doing 3 pages now, but if this is temporary and you're going to make it 800, then it's a great idea.你现在只写了 3 页，但如果这是暂时的并且你打算让它变成 800 页，那么这是一个好主意。
"waitUntil": "networkidle0" is really slow. "waitUntil": "networkidle0"真的很慢。 Since all you're doing is accessing document.location.href you can probably speed things up with waitUntil: "domcontentloaded" .由于您所做的只是访问document.location.href ，因此您可以使用waitUntil: "domcontentloaded"加快速度。
JS uses camelCase , not snake_case . JS 使用camelCase ，而不是snake_case 。
If you have an ElementHandle, you can just elementHandle.evaluate(...) rather than page.evaluate(..., elementHandle) .如果你有一个 ElementHandle，你可以只elementHandle.evaluate(...)而不是page.evaluate(..., elementHandle) 。
Catch errors with catch and clean up the browser resource with finally .使用catch捕获错误并使用finally清理browser资源。
let browser; was pointless in your original code.在您的原始代码中毫无意义。

puppeteer - 从循环中导出 JSON 文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-21 18:42:36

puppeteer - 从循环中导出 JSON 文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-21 18:42:36

解决方案1
1 已采纳 2022-07-21 18:42:36