[英]Puppeteer with lazy loading images
So I am trying to pull out information using data scraping from this real estate website ( https://www.zillow.com/vancouver-bc/ ) I am able to get all the information about the listing on the page but with images (image links/src), after a few of them, the result is some garbage.因此,我尝试使用从该房地产网站 ( https://www.zillow.com/vancouver-bc/ ) 抓取的数据提取信息,我能够获取有关页面上列表的所有信息,但带有图像(图片链接/src),几个之后,结果是一些垃圾。 I tried researching and found it was because of lazy loading.我尝试研究,发现这是因为延迟加载。 For which is tried almost all the methods available and answered by others but none seem to work - this includes scrolling to the bottom, scrolling with delays ( https://www.npmjs.com/package/puppeteer-autoscroll-down ), zooming out the browser as much as I can to get the images to render.为此尝试了几乎所有可用的方法并由其他人回答,但似乎都没有工作 - 这包括滚动到底部,延迟滚动( https://www.npmjs.com/package/puppeteer-autoscroll-down ),缩放尽可能多地退出浏览器来渲染图像。 But it still doesn't work.但它仍然不起作用。 I have been looking everywhere for hours now before I decided to post my question and code here itself for anyone else to figure it out.在我决定在这里发布我的问题和代码之前,我已经到处寻找了几个小时,以便其他人弄清楚。
let cheerio = require('cheerio')
let puppeteer = require('puppeteer-extra')
const pluginStealth = require("puppeteer-extra-plugin-stealth")
puppeteer.use(pluginStealth())
let userAgent = require('random-useragent')
const baseURL = "https://www.zillow.com/vancouver-bc"
let estateData = []
let urlLinks = []
let scrollPageToBottom = require('puppeteer-autoscroll-down')
let getEstateData = async () => {
estateData = []
urlLinks = []
let url
for (let pgNum = 1; pgNum <= 1; pgNum++) {
if (pgNum === 1) {
url = baseURL + "/"
} else {
url = baseURL + ("/" + pgNum + "_p")
}
urlLinks.push(url)
}
await searchWebsite()
console.log("search over")
return estateData
//module.exports = estateData
}
let searchWebsite = async () => {
await puppeteer
.launch({headless : false})
.then(async function (browser) {
let page = await browser.newPage();
// await page.setRequestInterception(true)
//
// page.on('request', (req) => {
// if( req.resourceType() === 'image' || req.resourceType() === 'stylesheet' || req.resourceType() === 'font'){
// req.abort()
// }
// else {
// req.continue()
// }
//
// })
let html
await page.setUserAgent(userAgent.getRandom())
for(let url of urlLinks){
console.log(url)
await page.goto(url).then(async function () {
html = await page.content();
let obj = await cheerio('.list-card-link.list-card-info', html)
let imgObj = await cheerio(".list-card-top", html)
let geoLocation = await cheerio(".photo-cards.photo-cards_wow", html)
// await page.waitForSelector('img',{
// visible: true,
// })
// await page.evaluate(() => { window.scrollTo(0, document.body.scrollHeight)})
const scrollStep = 250 // default
const scrollDelay = 100 // default
const lastPosition = await scrollPageToBottom(page, scrollStep, scrollDelay)
await page.waitFor(2000)
let num = 0
console.log(obj.length)
for (let key in obj) {
if (obj[key].attribs) {
try {
let geoStr = await geoLocation[0].children[0].children[0].children[0].data
let geoObj = await (JSON.parse(geoStr)["geo"])
let extractedInfo = {
estateName : await obj[key].children[0].children[0].data,
estatePrice : await obj[key].children[2].children[0].children[0].data,
saleType : await obj[key].children[1].children[0].next.data,
estateConfig : {
beds : await obj[key].children[2].children[1].children[0].children[0].data,
bath : await obj[key].children[2].children[1].children[1].children[0].data,
area : await obj[key].children[2].children[1].children[2].children[0].data
},
estateLocation : {
longitude : await geoObj.longitude,
latitude : await geoObj.latitude
},
estateLink : await obj[key].attribs.href,
estateCoverImgLink : await imgObj[num++].children[2].children[0].attribs.src
}
console.log(extractedInfo.estateName, imgObj[num].children[2].children[0].attribs.src)
await estateData.push(extractedInfo)
}
catch (e) {
console.log("Estate Skipped - ", obj[key].children[0].children[0].data, obj[key].attribs.href)
console.log(e)
}
}
}
console.log(estateData.length)
});
}
//Now read the page
console.log("total - ", estateData.length)
await page.close()
await browser.close()
})
.catch(function (err) {
console.log(err)
});
}
module.exports.getEstateData = getEstateData
I had a similar issue and found a working answer here .我有一个类似的问题,并在这里找到了一个有效的答案。 Hopefully this works for you too.希望这也适用于您。 The interval was a little slow so I changed it from 100 to 30.间隔有点慢,所以我把它从 100 改为 30。
I was able to solve this with a pretty simple implementation using the puppeteer-autoscroll-down library as you mentioned.正如你提到的,我能够使用puppeteer-autoscroll-down库通过一个非常简单的实现来解决这个问题。 I'm not sure which images you were specifically attempting to grab, but this worked for me.我不确定您专门尝试抓取哪些图像,但这对我有用。
// Set the initial viewport and navigate to the page
await page.setViewport({ width: 1300, height: 1000 });
await page.goto('https://www.zillow.com/vancouver-bc/', { waitUntil: 'load' });
// Scroll to the very top of the page
await page.evaluate(_ => {
window.scrollTo(0, 0);
});
// Scroll to the bottom of the page with puppeteer-autoscroll-down
await scrollPageToBottom(page);
// Get your image links
let imageLinks = await page.$$eval('.list-card img', imgLinks => {
return imgLinks.map((i) => i.src);
});
imageLinks was an array with 40 fully formed links, https://photos.zillowstatic.com/p_e/ISz7wlfm278p501000000000.jpg is one example. imageLinks 是一个包含 40 个完整链接的数组, https ://photos.zillowstatic.com/p_e/ISz7wlfm278p501000000000.jpg 就是一个例子。
Hope that helps you, this was a pretty brutal one for me to solve as well.希望对您有所帮助,这对我来说也是一个非常残酷的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.