简体   繁体   中英

How to scrape the src image with puppeteer?

I need the src image from the popup link. https://www.tokopedia.com/pusatvalve/1-2-inch-ball-valve-sankyo-mojekerto I have tried like this

const popup = await page.$('div.css-hnnye.ew904gd0');
    const maxLoop = await page.evaluate(() => {
      let contain = document.querySelectorAll('div.css-1muhp5u.ejaoon00');
      return contain.length;
    });

    let image1 = '';
    let image2 = '';
    let image3 = '';
    let image4 = '';
    let image5 = '';

    if (0 <= Number(maxLoop)) {
      image1 = await popup.evaluate( popup => {
        popup.click()
        let image = document.querySelector('img.css-udmgcf').src;
        return image;
      } );
    }

    await page.keyboard.press('Escape');
    await page.keyboard.up('Escape');
    await page.click('div.css-xwybk > div > div > div:nth-child(2) > div');

    const popup2 = await page.$('div.css-hnnye.ew904gd0');

    if (1 <= Number(maxLoop)) {
      image2 = await popup2.evaluate( popup2 => {
        popup2.click()
        let image = document.querySelector('img.css-udmgcf').src;
        return image;
      } );
    }

    image1 !== '' ? item.image1 = image1 : '';
    image2 !== '' ? item.image2 = image2 : '';
    image3 !== '' ? item.image3 = image3 : '';
    image4 !== '' ? item.image4 = image4 : '';
    image5 !== '' ? item.image5 = image5 : '';

but the result is always the same picture. result

note: i want to get format src.jpeg enter image description here

You can try doing something like this:

const puppeteer = require('puppeteer')

const PAGE_URL = ' ... ' // the page to scrape the images from

const browser = puppeteer.launch({
    headless: true
});

(async function () {
    const page = await (await browser).newPage()

    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36');
    await page.setViewport({ width: 960, height: 768 });

    await page.goto(PAGE_URL, {
        timeout: 60000
    })

    const scrapedImages = await page.evaluate(async () => {
        const asyncSleep = (ms) => new Promise((rs, _) => setTimeout(rs, ms))

        const images = []

        for (const eachThumbnail of document.querySelectorAll("div[data-testid='PDPImageThumbnail'] > div > img")) {
            await eachThumbnail.click()

            let imageSrc = document.querySelector("div[data-testid='PDPImageMain'] > div > div > img").src

            while (images.includes(imageSrc) || imageSrc.startsWith('data:')) {
                imageSrc = document.querySelector("div[data-testid='PDPImageMain'] > div > div > img").src;
                await asyncSleep(1000)
            }

            images.push(imageSrc)
        }

        return images
    })

    console.log(scrapedImages)

})()

Here the script uses data-testid attribute to select the elements since it's more stable in comparison instead of div.css-xwybk or other similar class names (which I assume would change frequently).

The other thing is since the thumbnails aren't in the original size, the script clicks on them and waits until the original image is rendered and then store the url. (Also avoids base64 images src since these are used to show a loading indicator).

NOTE: Before you do anything automated to ANY site, please make sure what you're doing isn't forbidden or against the site's policy though. (This answer is strictly to show you how puppeteer could be used for such purpose and isn't to encourage you to do it)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM