简体   繁体   中英

Using Puppeteer, how would you scrape a website for titles and images and have them be in the same object so that the image is related to the title?

I am able to get the image src and the title in seperate variables with this code,

  let theOfficeUrl =
    "https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";

  let browser = await puppeteer.launch({
    headless: true,
    defaultViewport: null,
  });
  let page = await browser.newPage();

  await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };

  let data = await page.evaluate(() => {
    var image = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery img")
    ).map((image) => image.src);

    // gives us an array off all h3 titles on page
    var title = Array.from(document.querySelectorAll("h3")).map(
      (title) => title.innerText
    );
    let forDeletion = ["", "Leave a Comment:"];
    title = title.filter((item) => !forDeletion.includes(item));

    return {
      image,
      title,
    };
  });
  console.log("Running Scraper...");
  console.log({ data });
  console.log("======================");
})();

which yields results like this

data: {
   image: [Array of image srcs],
   title: [Array of title text]
 }
}

But I need them to be an array of objects that have the corresponding titles and image srcs like this

{
data: [
   {
   item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
....so on
 ]
}


Problem i am running into is the website does not have each image and title in a seperate div, they are all in one container div with h3 tags holding title with no class names, and img being in p tags and sometimes h3 tags as well. Website i am trying to scrape

https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures

Trying to scrape the Funko Pop Yu-Gi-Oh. Figures Gallery portion where it has the name of the funko pop and the image beneath it.

Any pointers on this?

Once you have got the individual arrays in the data object, you can create the desired array like so:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

This should give you:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

You can try something like this (as images ara lazy-loaded, data-src attribute is more appropriate here):

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

try {
  const [page] = await browser.pages();

  await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');

  const data = await page.evaluate(() => {
    const titles = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery h3")
    ).filter(title => title.innerText !== '');

    return titles.map(title => ({
      title: title.innerText,
      image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
    }));
  });
  console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM