简体   繁体   中英

Scrape multiple websites using Puppeteer

So I am trying to make a scrape just two elements but from more than only one website (in this case is PS Store). Also, I'm trying to achieve it in the easiest way possible. Since I'm rookie in JS, please be gentle ;) Below my script. I was trying to make it happen with a for loop but with no effect (still it got only the first website from the array). Thanks a lot for any kind of help.

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
  const urls = [
    "https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
    "https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
  ];
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  for (i = 0; i < urls.length; i++) {
    const url = urls[i];
    const promise = page.waitForNavigation({ waitUntil: "networkidle" });
    await page.goto(`${url}`);
    await promise;
  }

  const [el] = await page.$x(
    "/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
  );
  const txt = await el.getProperty("textContent");
  const title = await txt.jsonValue();

  const [el2] = await page.$x(
    "/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
  );
  const txt2 = await el2.getProperty("textContent");
  const price = await txt2.jsonValue();

  console.log({ title, price });

  browser.close();
}

scrapeProduct();

In general, your code is quite okay. Few things should be corrected, though:

const puppeteer = require("puppeteer");

async function scrapeProduct(url) {
    const urls = [
        "https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
        "https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
    ];
    const browser = await puppeteer.launch({
        headless: false
    });
    for (i = 0; i < urls.length; i++) {
        const page = await browser.newPage();
        const url = urls[i];
        const promise = page.waitForNavigation({
            waitUntil: "networkidle2"
        });
        await page.goto(`${url}`);
        await promise;
        const [el] = await page.$x(
            "/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
        );
        const txt = await el.getProperty("textContent");
        const title = await txt.jsonValue();

        const [el2] = await page.$x(
            "/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
        );
        const txt2 = await el2.getProperty("textContent");
        const price = await txt2.jsonValue();

        console.log({
            title,
            price
        });

    }
    browser.close();
}

scrapeProduct();
  1. You open webpage in the loop, that's correct, but then look for elements outside of the loop. Why? You should do it within the loop.
  2. For debugging, I suggest using { headless: false } . This allows you to see what actually happens in the browser.
  3. Not sure what version of puppeteer are you using, but there's no such event as networkidle in latest version from npm . You should use networkidle0 or networkidle2 instead.
  4. You are seeking the elements via xpath html/body/div... . This might be subjective, but I think standard JS/CSS selectors are more readable: body > div ... . But, well, if it works...

Code above yields the following in my case:

{ title: 'Days Gone™', price: '289,00 zl' }
{ title: 'Ghost of Tsushima', price: '289,00 zl' }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM