簡體   English   中英

如何使用 Nodejs Axios Cheerio 抓取隱藏的 Web 數據內容

[英]How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio

如果我執行這個 Node.js 代碼

const axios = require('axios');
const cheerio = require('cheerio');

axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
  headers: { "Accept-Encoding": "gzip,deflate,compress" }
})

  .then(({ data }) => {
    const $ = cheerio.load(data);

    console.log($('#product-attribute-specs-table').html());

  });

我得到這個輸出

 <style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li>  <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li>  <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li>  <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li>  <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li>  <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li>  <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li> 

雖然我期待得到這個結果

<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table">   <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span>  <span class="data" data-th="Codice">000590051</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span>  <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span>  <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span>  <span class="data" data-th="Formato">Spray</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span>  <span class="data" data-th="Formulazione">Soluzione</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span>  <span class="data" data-th="Capacità">0 - 50 ml</span> </li>  </ul>

你知道這些數據來自哪里嗎?

要使用 Node.js 抓取隱藏的 Web 數據內容,您可以使用 Axios 庫發送 HTTP 請求,並使用 Cheerio 庫解析和操作 HTML 或 XML 響應。

以下是如何使用這些庫從網頁中抓取數據的示例:

 const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();

在這個例子中,Axios 庫用於向網頁發送 HTTP GET 請求,Cheerio 庫用於解析 HTML 響應並找到 ID 為 hidden-data 的元素。 然后將此元素的文本內容記錄到控制台。

您可以使用類似的技術來抓取其他類型的數據,例如來自 API 或 XML 文檔的數據。 您還可以使用 Axios 和 Cheerio 提交表單數據、跟蹤鏈接以及在網頁上執行其他操作。

有關更多信息,您可以閱讀 Axios 和 Cheerio 的文檔:

Axios: https ://github.com/axios/axios Cheerio: https ://cheerio.js.org/

使用puppeteer讓生活更輕松。

本教程有助於了解Puppeteer 簡介

你可以得到它這個代碼。

const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');

async function getList() {
    try {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setDefaultNavigationTimeout(0);
        await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');

        // extract the list from web page by id
        const list = await page.evaluate(() => {
            return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
        });
        await browser.close();
        return Promise.resolve(list);
    } catch (error) {
        return Promise.reject(error);
    }
}

getList()
    .then((result) => console.log(highlight(result)))
    .catch(error => console.log(error));

結果

$ node get-data.js
<ul
  class="data list additional-attributes ml-0 p-0"
  id="product-attribute-specs-table"
>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Codice:</span>
    <span class="data" data-th="Codice">000590051</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Marchio:</span>
    <a
      href="https://www.efarma.com/rinazina.html"
      class="text-underline link-secondary"
    >
      <span class="data" data-th="Marchio">Rinazina</span>
    </a>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Produttore:</span>
    <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formato:</span>
    <span class="data" data-th="Formato">Spray</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formulazione:</span>
    <span class="data" data-th="Formulazione">Soluzione</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Capacità:</span>
    <span class="data" data-th="Capacità">0 - 50 ml</span>
  </li>
</ul>

在此處輸入圖像描述

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM