[英]Scrape multiple websites using Puppeteer
So I am trying to make a scrape just two elements but from more than only one website (in this case is PS Store).所以我试图从多个网站(在这种情况下是 PS Store)中抓取两个元素。 Also, I'm trying to achieve it in the easiest way possible.
此外,我试图以最简单的方式实现它。 Since I'm rookie in JS, please be gentle ;) Below my script.
由于我是 JS 新手,请保持温和 ;) 在我的脚本下方。 I was trying to make it happen with a for loop but with no effect (still it got only the first website from the array).
我试图用 for 循环让它发生但没有效果(它仍然只从数组中获得第一个网站)。 Thanks a lot for any kind of help.
非常感谢您提供的任何帮助。
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch();
const page = await browser.newPage();
for (i = 0; i < urls.length; i++) {
const url = urls[i];
const promise = page.waitForNavigation({ waitUntil: "networkidle" });
await page.goto(`${url}`);
await promise;
}
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({ title, price });
browser.close();
}
scrapeProduct();
In general, your code is quite okay.一般来说,你的代码是相当不错的。 Few things should be corrected, though:
不过,应该纠正一些事情:
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = [
"https://store.playstation.com/pl-pl/product/EP9000-CUSA09176_00-DAYSGONECOMPLETE",
"https://store.playstation.com/pl-pl/product/EP9000-CUSA13323_00-GHOSTSHIP0000000",
];
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
await promise;
const [el] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[2]/h2"
);
const txt = await el.getProperty("textContent");
const title = await txt.jsonValue();
const [el2] = await page.$x(
"/html/body/div[3]/div/div/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/h3"
);
const txt2 = await el2.getProperty("textContent");
const price = await txt2.jsonValue();
console.log({
title,
price
});
}
browser.close();
}
scrapeProduct();
{ headless: false }
.{ headless: false }
。 This allows you to see what actually happens in the browser.networkidle
in latest version from npm
.npm
最新版本中没有诸如networkidle
之类的事件。 You should use networkidle0
or networkidle2
instead.networkidle0
或networkidle2
。html/body/div...
.html/body/div...
寻找元素。 This might be subjective, but I think standard JS/CSS selectors are more readable: body > div ...
.body > div ...
。 But, well, if it works... Code above yields the following in my case:在我的情况下,上面的代码产生以下结果:
{ title: 'Days Gone™', price: '289,00 zl' }
{ title: 'Ghost of Tsushima', price: '289,00 zl' }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.