简体   繁体   English

使用 Puppeteer 抓取多个网站

[英]Scrape multiple websites using Puppeteer

So I am trying to make a scrape just two elements but from more than only one website (in this case is PS Store).所以我试图从多个网站(在这种情况下是 PS Store)中抓取两个元素。 Also, I'm trying to achieve it in the easiest way possible.此外,我试图以最简单的方式实现它。 Since I'm rookie in JS, please be gentle ;) Below my script.由于我是 JS 新手,请保持温和 ;) 在我的脚本下方。 I was trying to make it happen with a for loop but with no effect (still it got only the first website from the array).我试图用 for 循环让它发生但没有效果(它仍然只从数组中获得第一个网站)。 Thanks a lot for any kind of help.非常感谢您提供的任何帮助。

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
  const urls = [
    "https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
    "https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
  ];
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  for (i = 0; i < urls.length; i++) {
    const url = urls[i];
    const promise = page.waitForNavigation({ waitUntil: "networkidle" });
    await page.goto(`${url}`);
    await promise;
  }

  const [el] = await page.$x(
    "/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
  );
  const txt = await el.getProperty("textContent");
  const title = await txt.jsonValue();

  const [el2] = await page.$x(
    "/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
  );
  const txt2 = await el2.getProperty("textContent");
  const price = await txt2.jsonValue();

  console.log({ title, price });

  browser.close();
}

scrapeProduct();

In general, your code is quite okay.一般来说,你的代码是相当不错的。 Few things should be corrected, though:不过,应该纠正一些事情:

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
    const urls = [
        "https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
        "https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
    ];
    const browser = await puppeteer.launch({
        headless: false
    });
    for (i = 0; i < urls.length; i++) {
        const page = await browser.newPage();
        const url = urls[i];
        const promise = page.waitForNavigation({
            waitUntil: "networkidle2"
        });
        await page.goto(`${url}`);
        await promise;
        const [el] = await page.$x(
            "/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
        );
        const txt = await el.getProperty("textContent");
        const title = await txt.jsonValue();

        const [el2] = await page.$x(
            "/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
        );
        const txt2 = await el2.getProperty("textContent");
        const price = await txt2.jsonValue();

        console.log({
            title,
            price
        });

    }
    browser.close();
}

scrapeProduct();
  1. You open webpage in the loop, that's correct, but then look for elements outside of the loop.您在循环中打开网页,这是正确的,然后在循环外查找元素。 Why?为什么? You should do it within the loop.您应该在循环内进行。
  2. For debugging, I suggest using { headless: false } .对于调试,我建议使用{ headless: false } This allows you to see what actually happens in the browser.这使您可以查看浏览器中实际发生的情况。
  3. Not sure what version of puppeteer are you using, but there's no such event as networkidle in latest version from npm .不确定您使用的是哪个版本的 puppeteer,但是在npm最新版本中没有诸如networkidle之类的事件。 You should use networkidle0 or networkidle2 instead.您应该改用networkidle0networkidle2
  4. You are seeking the elements via xpath html/body/div... .您正在通过 xpath html/body/div...寻找元素。 This might be subjective, but I think standard JS/CSS selectors are more readable: body > div ... .这可能是主观的,但我认为标准的 JS/CSS 选择器更具可读性: body > div ... But, well, if it works...但是,好吧,如果它有效......

Code above yields the following in my case:在我的情况下,上面的代码产生以下结果:

{ title: 'Days Gone™', price: '289,00 zl' }
{ title: 'Ghost of Tsushima', price: '289,00 zl' }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM