繁体   English   中英

如何使用 Nodejs Axios Cheerio 抓取隐藏的 Web 数据内容

[英]How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio

如果我执行这个 Node.js 代码

const axios = require('axios');
const cheerio = require('cheerio');

axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
  headers: { "Accept-Encoding": "gzip,deflate,compress" }
})

  .then(({ data }) => {
    const $ = cheerio.load(data);

    console.log($('#product-attribute-specs-table').html());

  });

我得到这个输出

 <style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li>  <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li>  <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li>  <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li>  <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li>  <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li>  <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li> 

虽然我期待得到这个结果

<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table">   <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span>  <span class="data" data-th="Codice">000590051</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span>  <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span>  <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span>  <span class="data" data-th="Formato">Spray</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span>  <span class="data" data-th="Formulazione">Soluzione</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span>  <span class="data" data-th="Capacità">0 - 50 ml</span> </li>  </ul>

你知道这些数据来自哪里吗?

要使用 Node.js 抓取隐藏的 Web 数据内容,您可以使用 Axios 库发送 HTTP 请求,并使用 Cheerio 库解析和操作 HTML 或 XML 响应。

以下是如何使用这些库从网页中抓取数据的示例:

 const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();

在这个例子中,Axios 库用于向网页发送 HTTP GET 请求,Cheerio 库用于解析 HTML 响应并找到 ID 为 hidden-data 的元素。 然后将此元素的文本内容记录到控制台。

您可以使用类似的技术来抓取其他类型的数据,例如来自 API 或 XML 文档的数据。 您还可以使用 Axios 和 Cheerio 提交表单数据、跟踪链接以及在网页上执行其他操作。

有关更多信息,您可以阅读 Axios 和 Cheerio 的文档:

Axios: https ://github.com/axios/axios Cheerio: https ://cheerio.js.org/

使用puppeteer让生活更轻松。

本教程有助于了解Puppeteer 简介

你可以得到它这个代码。

const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');

async function getList() {
    try {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setDefaultNavigationTimeout(0);
        await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');

        // extract the list from web page by id
        const list = await page.evaluate(() => {
            return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
        });
        await browser.close();
        return Promise.resolve(list);
    } catch (error) {
        return Promise.reject(error);
    }
}

getList()
    .then((result) => console.log(highlight(result)))
    .catch(error => console.log(error));

结果

$ node get-data.js
<ul
  class="data list additional-attributes ml-0 p-0"
  id="product-attribute-specs-table"
>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Codice:</span>
    <span class="data" data-th="Codice">000590051</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Marchio:</span>
    <a
      href="https://www.efarma.com/rinazina.html"
      class="text-underline link-secondary"
    >
      <span class="data" data-th="Marchio">Rinazina</span>
    </a>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Produttore:</span>
    <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formato:</span>
    <span class="data" data-th="Formato">Spray</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formulazione:</span>
    <span class="data" data-th="Formulazione">Soluzione</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Capacità:</span>
    <span class="data" data-th="Capacità">0 - 50 ml</span>
  </li>
</ul>

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM