简体   繁体   English

使用 Puppeteer,您将如何从网站上抓取标题和图像,并将它们放在同一个 object 中,以便图像与标题相关?

[英]Using Puppeteer, how would you scrape a website for titles and images and have them be in the same object so that the image is related to the title?

I am able to get the image src and the title in seperate variables with this code,我可以使用此代码在单独的变量中获取图像 src 和标题,

  let theOfficeUrl =
    "https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";

  let browser = await puppeteer.launch({
    headless: true,
    defaultViewport: null,
  });
  let page = await browser.newPage();

  await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };

  let data = await page.evaluate(() => {
    var image = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery img")
    ).map((image) => image.src);

    // gives us an array off all h3 titles on page
    var title = Array.from(document.querySelectorAll("h3")).map(
      (title) => title.innerText
    );
    let forDeletion = ["", "Leave a Comment:"];
    title = title.filter((item) => !forDeletion.includes(item));

    return {
      image,
      title,
    };
  });
  console.log("Running Scraper...");
  console.log({ data });
  console.log("======================");
})();

which yields results like this产生这样的结果

data: {
   image: [Array of image srcs],
   title: [Array of title text]
 }
}

But I need them to be an array of objects that have the corresponding titles and image srcs like this但我需要它们成为具有相应标题和图像 src 的对象数组,如下所示

{
data: [
   {
   item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
item: {
      title: "title from website",
      image: "image src from website"
   }
....so on
 ]
}


Problem i am running into is the website does not have each image and title in a seperate div, they are all in one container div with h3 tags holding title with no class names, and img being in p tags and sometimes h3 tags as well.我遇到的问题是网站没有在单独的 div 中包含每个图像和标题,它们都在一个容器 div 中,带有 h3 标签的标题没有 class 名称,img 在 p 标签中,有时也是 h3 标签。 Website i am trying to scrape我正在尝试抓取的网站

https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures

Trying to scrape the Funko Pop Yu-Gi-Oh.试图刮掉 Funko Pop Yu-Gi-Oh。 Figures Gallery portion where it has the name of the funko pop and the image beneath it. Figures Gallery 部分,其中包含 funko pop 的名称及其下方的图像。

Any pointers on this?对此有任何指示吗?

Once you have got the individual arrays in the data object, you can create the desired array like so:在数据 object 中获得单独的 arrays 后,您可以像这样创建所需的数组:

data = {
    image: ["image1 src", "image2 src", "image3 src", "image4 src"],
    title: ["title1", "title2", "title3", "title4"]
}

data_new = [];
for (i=0;i<data.image.length;i++) {
  data_new.push({'image':data.image[i], 'title': data.title[i]})
}

This should give you:这应该给你:

data_new = [
    {
        "image": "image1 src",
        "title": "title1"
    },
    {
        "image": "image2 src",
        "title": "title2"
    },
    {
        "image": "image3 src",
        "title": "title3"
    },
    {
        "image": "image4 src",
        "title": "title4"
    }
]

You can try something like this (as images ara lazy-loaded, data-src attribute is more appropriate here):您可以尝试这样的事情(因为图像 ara 延迟加载, data-src属性在这里更合适):

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

try {
  const [page] = await browser.pages();

  await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');

  const data = await page.evaluate(() => {
    const titles = Array.from(
      document.querySelectorAll("div.post_anchor_divs.gallery h3")
    ).filter(title => title.innerText !== '');

    return titles.map(title => ({
      title: title.innerText,
      image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
    }));
  });
  console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM