[英]How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio
如果我執行這個 Node.js 代碼
const axios = require('axios');
const cheerio = require('cheerio');
axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
headers: { "Accept-Encoding": "gzip,deflate,compress" }
})
.then(({ data }) => {
const $ = cheerio.load(data);
console.log($('#product-attribute-specs-table').html());
});
我得到這個輸出
<style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li> <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li> <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li> <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li> <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li> <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li> <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li>
雖然我期待得到這個結果
<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table"> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span> <span class="data" data-th="Codice">000590051</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span> <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span> <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span> <span class="data" data-th="Formato">Spray</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span> <span class="data" data-th="Formulazione">Soluzione</span> </li> <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span> <span class="data" data-th="Capacità">0 - 50 ml</span> </li> </ul>
你知道這些數據來自哪里嗎?
要使用 Node.js 抓取隱藏的 Web 數據內容,您可以使用 Axios 庫發送 HTTP 請求,並使用 Cheerio 庫解析和操作 HTML 或 XML 響應。
以下是如何使用這些庫從網頁中抓取數據的示例:
const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();
在這個例子中,Axios 庫用於向網頁發送 HTTP GET 請求,Cheerio 庫用於解析 HTML 響應並找到 ID 為 hidden-data 的元素。 然后將此元素的文本內容記錄到控制台。
您可以使用類似的技術來抓取其他類型的數據,例如來自 API 或 XML 文檔的數據。 您還可以使用 Axios 和 Cheerio 提交表單數據、跟蹤鏈接以及在網頁上執行其他操作。
有關更多信息,您可以閱讀 Axios 和 Cheerio 的文檔:
Axios: https ://github.com/axios/axios Cheerio: https ://cheerio.js.org/
使用puppeteer
讓生活更輕松。
本教程有助於了解Puppeteer 簡介
你可以得到它這個代碼。
const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');
async function getList() {
try {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(0);
await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');
// extract the list from web page by id
const list = await page.evaluate(() => {
return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
});
await browser.close();
return Promise.resolve(list);
} catch (error) {
return Promise.reject(error);
}
}
getList()
.then((result) => console.log(highlight(result)))
.catch(error => console.log(error));
結果
$ node get-data.js
<ul
class="data list additional-attributes ml-0 p-0"
id="product-attribute-specs-table"
>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Codice:</span>
<span class="data" data-th="Codice">000590051</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Marchio:</span>
<a
href="https://www.efarma.com/rinazina.html"
class="text-underline link-secondary"
>
<span class="data" data-th="Marchio">Rinazina</span>
</a>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Produttore:</span>
<span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Formato:</span>
<span class="data" data-th="Formato">Spray</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Formulazione:</span>
<span class="data" data-th="Formulazione">Soluzione</span>
</li>
<li class="list-unstyled attribute text-body font-body-sm">
<span class="label font-weight-bold">Capacità:</span>
<span class="data" data-th="Capacità">0 - 50 ml</span>
</li>
</ul>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.