[英]Using Puppeteer, how would you scrape a website for titles and images and have them be in the same object so that the image is related to the title?
我可以使用此代碼在單獨的變量中獲取圖像 src 和標題,
let theOfficeUrl =
"https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures";
let browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
});
let page = await browser.newPage();
await page.goto(theOfficeUrl), { waitUntil: "networkidle2" };
let data = await page.evaluate(() => {
var image = Array.from(
document.querySelectorAll("div.post_anchor_divs.gallery img")
).map((image) => image.src);
// gives us an array off all h3 titles on page
var title = Array.from(document.querySelectorAll("h3")).map(
(title) => title.innerText
);
let forDeletion = ["", "Leave a Comment:"];
title = title.filter((item) => !forDeletion.includes(item));
return {
image,
title,
};
});
console.log("Running Scraper...");
console.log({ data });
console.log("======================");
})();
產生這樣的結果
data: {
image: [Array of image srcs],
title: [Array of title text]
}
}
但我需要它們成為具有相應標題和圖像 src 的對象數組,如下所示
{
data: [
{
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
item: {
title: "title from website",
image: "image src from website"
}
....so on
]
}
我遇到的問題是網站沒有在單獨的 div 中包含每個圖像和標題,它們都在一個容器 div 中,帶有 h3 標簽的標題沒有 class 名稱,img 在 p 標簽中,有時也是 h3 標簽。 我正在嘗試抓取的網站
https://www.cardboardconnection.com/funko-pop-yu-gi-oh-vinyl-figures
試圖刮掉 Funko Pop Yu-Gi-Oh。 Figures Gallery 部分,其中包含 funko pop 的名稱及其下方的圖像。
對此有任何指示嗎?
在數據 object 中獲得單獨的 arrays 后,您可以像這樣創建所需的數組:
data = {
image: ["image1 src", "image2 src", "image3 src", "image4 src"],
title: ["title1", "title2", "title3", "title4"]
}
data_new = [];
for (i=0;i<data.image.length;i++) {
data_new.push({'image':data.image[i], 'title': data.title[i]})
}
這應該給你:
data_new = [
{
"image": "image1 src",
"title": "title1"
},
{
"image": "image2 src",
"title": "title2"
},
{
"image": "image3 src",
"title": "title3"
},
{
"image": "image4 src",
"title": "title4"
}
]
您可以嘗試這樣的事情(因為圖像 ara 延遲加載, data-src
屬性在這里更合適):
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
try {
const [page] = await browser.pages();
await page.goto('https://www.cardboardconnection.com/funko-pop-the-office-vinyl-figures');
const data = await page.evaluate(() => {
const titles = Array.from(
document.querySelectorAll("div.post_anchor_divs.gallery h3")
).filter(title => title.innerText !== '');
return titles.map(title => ({
title: title.innerText,
image: title.nextSibling.nextSibling.querySelector('img').dataset.src,
}));
});
console.log(data);
} catch(err) { console.error(err); } finally { await browser.close(); }
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.