简体   繁体   English

如何使用 Nodejs Axios Cheerio 抓取隐藏的 Web 数据内容

[英]How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio

If I execute this Node.js code如果我执行这个 Node.js 代码

const axios = require('axios');
const cheerio = require('cheerio');

axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
  headers: { "Accept-Encoding": "gzip,deflate,compress" }
})

  .then(({ data }) => {
    const $ = cheerio.load(data);

    console.log($('#product-attribute-specs-table').html());

  });

I am getting this output我得到这个输出

 <style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li>  <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li>  <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li>  <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li>  <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li>  <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li>  <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li> 

While I am expecting to get this result虽然我期待得到这个结果

<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table">   <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span>  <span class="data" data-th="Codice">000590051</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span>  <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span>  <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span>  <span class="data" data-th="Formato">Spray</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span>  <span class="data" data-th="Formulazione">Soluzione</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span>  <span class="data" data-th="Capacità">0 - 50 ml</span> </li>  </ul>

Do you know where those data is coming from?你知道这些数据来自哪里吗?

To scrape hidden web data content with Node.js, you can use the Axios library to send HTTP requests and the Cheerio library to parse and manipulate the HTML or XML response.要使用 Node.js 抓取隐藏的 Web 数据内容,您可以使用 Axios 库发送 HTTP 请求,并使用 Cheerio 库解析和操作 HTML 或 XML 响应。

Here's an example of how you can use these libraries to scrape data from a web page:以下是如何使用这些库从网页中抓取数据的示例:

 const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();

In this example, the Axios library is used to send an HTTP GET request to the web page, and the Cheerio library is used to parse the HTML response and find the element with the ID hidden-data.在这个例子中,Axios 库用于向网页发送 HTTP GET 请求,Cheerio 库用于解析 HTML 响应并找到 ID 为 hidden-data 的元素。 The text content of this element is then logged to the console.然后将此元素的文本内容记录到控制台。

You can use similar techniques to scrape other types of data, such as data from APIs or from XML documents.您可以使用类似的技术来抓取其他类型的数据,例如来自 API 或 XML 文档的数据。 You can also use Axios and Cheerio to submit form data, follow links, and perform other actions on web pages.您还可以使用 Axios 和 Cheerio 提交表单数据、跟踪链接以及在网页上执行其他操作。

For more information, you can read the documentation for Axios and Cheerio:有关更多信息,您可以阅读 Axios 和 Cheerio 的文档:

Axios: https://github.com/axios/axios Cheerio: https://cheerio.js.org/ Axios: https ://github.com/axios/axios Cheerio: https ://cheerio.js.org/

Using puppeteer it makes life easier.使用puppeteer让生活更轻松。

This tutorial help to learn about Introduction to Puppeteer本教程有助于了解Puppeteer 简介

You can get it this code.你可以得到它这个代码。

const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');

async function getList() {
    try {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setDefaultNavigationTimeout(0);
        await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');

        // extract the list from web page by id
        const list = await page.evaluate(() => {
            return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
        });
        await browser.close();
        return Promise.resolve(list);
    } catch (error) {
        return Promise.reject(error);
    }
}

getList()
    .then((result) => console.log(highlight(result)))
    .catch(error => console.log(error));

Result结果

$ node get-data.js
<ul
  class="data list additional-attributes ml-0 p-0"
  id="product-attribute-specs-table"
>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Codice:</span>
    <span class="data" data-th="Codice">000590051</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Marchio:</span>
    <a
      href="https://www.efarma.com/rinazina.html"
      class="text-underline link-secondary"
    >
      <span class="data" data-th="Marchio">Rinazina</span>
    </a>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Produttore:</span>
    <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formato:</span>
    <span class="data" data-th="Formato">Spray</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formulazione:</span>
    <span class="data" data-th="Formulazione">Soluzione</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Capacità:</span>
    <span class="data" data-th="Capacità">0 - 50 ml</span>
  </li>
</ul>

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM