[英]How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio
如果我执行这个 Node.js 代码
const axios = require('axios');
const cheerio = require('cheerio');
axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
headers: { "Accept-Encoding": "gzip,deflate,compress" }
})
.then(({ data }) => {
const $ = cheerio.load(data);
console.log($('#product-attribute-specs-table').html());
});
我得到这个输出
<style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li> <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li> <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li> <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li> <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li> <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li> <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li>
虽然我期待得到这个结果
<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table"> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span> <span class="data" data-th="Codice">000590051</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span> <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span> <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span> <span class="data" data-th="Formato">Spray</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span> <span class="data" data-th="Formulazione">Soluzione</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span> <span class="data" data-th="Capacità">0 - 50 ml</span> </li> </ul>
你知道这些数据来自哪里吗?
要使用 Node.js 抓取隐藏的 Web 数据内容,您可以使用 Axios 库发送 HTTP 请求,并使用 Cheerio 库解析和操作 HTML 或 XML 响应。
以下是如何使用这些库从网页中抓取数据的示例:
const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();
在这个例子中,Axios 库用于向网页发送 HTTP GET 请求,Cheerio 库用于解析 HTML 响应并找到 ID 为 hidden-data 的元素。 然后将此元素的文本内容记录到控制台。
您可以使用类似的技术来抓取其他类型的数据,例如来自 API 或 XML 文档的数据。 您还可以使用 Axios 和 Cheerio 提交表单数据、跟踪链接以及在网页上执行其他操作。
有关更多信息,您可以阅读 Axios 和 Cheerio 的文档:
Axios: https ://github.com/axios/axios Cheerio: https ://cheerio.js.org/
使用puppeteer
让生活更轻松。
本教程有助于了解Puppeteer 简介
你可以得到它这个代码。
const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');
async function getList() {
try {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(0);
await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');
// extract the list from web page by id
const list = await page.evaluate(() => {
return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
});
await browser.close();
return Promise.resolve(list);
} catch (error) {
return Promise.reject(error);
}
}
getList()
.then((result) => console.log(highlight(result)))
.catch(error => console.log(error));
结果
$ node get-data.js
<ul
class="data list additional-attributes ml-0 p-0"
id="product-attribute-specs-table"
>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Codice:</span>
<span class="data" data-th="Codice">000590051</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Marchio:</span>
<a
href="https://www.efarma.com/rinazina.html"
class="text-underline link-secondary"
>
<span class="data" data-th="Marchio">Rinazina</span>
</a>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Produttore:</span>
<span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Formato:</span>
<span class="data" data-th="Formato">Spray</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Formulazione:</span>
<span class="data" data-th="Formulazione">Soluzione</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Capacità:</span>
<span class="data" data-th="Capacità">0 - 50 ml</span>
</li>
</ul>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.