简体   繁体   中英

How To Scrape Hidden Web Data Content With Nodejs Axios Cheerio

If I execute this Node.js code

const axios = require('axios');
const cheerio = require('cheerio');

axios.get("https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html", {
  headers: { "Accept-Encoding": "gzip,deflate,compress" }
})

  .then(({ data }) => {
    const $ = cheerio.load(data);

    console.log($('#product-attribute-specs-table').html());

  });

I am getting this output

 <style> .additional-attributes.list>.attribute-1.skeleton { width: 122px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-1"></li>  <style> .additional-attributes.list>.attribute-2.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-2"></li>  <style> .additional-attributes.list>.attribute-3.skeleton { width: 128px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-3"></li>  <style> .additional-attributes.list>.attribute-4.skeleton { width: 120px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-4"></li>  <style> .additional-attributes.list>.attribute-5.skeleton { width: 129px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-5"></li>  <style> .additional-attributes.list>.attribute-6.skeleton { width: 145px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-6"></li>  <style> .additional-attributes.list>.attribute-7.skeleton { width: 135px; }</style><li class="list-unstyled attribute text-body font-body-sm skeleton attribute-7"></li> 

While I am expecting to get this result

<ul class="data list additional-attributes ml-0 p-0" id="product-attribute-specs-table">   <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Codice:</span>  <span class="data" data-th="Codice">000590051</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Marchio:</span>  <a href="https://www.efarma.com/rinazina.html" class="text-underline link-secondary"> <span class="data" data-th="Marchio">Rinazina</span> </a> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Produttore:</span>  <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formato:</span>  <span class="data" data-th="Formato">Spray</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Formulazione:</span>  <span class="data" data-th="Formulazione">Soluzione</span> </li>  <li class="list-unstyled attribute text-body font-body-sm"><span class="label font-weight-bold">Capacità:</span>  <span class="data" data-th="Capacità">0 - 50 ml</span> </li>  </ul>

Do you know where those data is coming from?

To scrape hidden web data content with Node.js, you can use the Axios library to send HTTP requests and the Cheerio library to parse and manipulate the HTML or XML response.

Here's an example of how you can use these libraries to scrape data from a web page:

 const axios = require('axios'); const cheerio = require('cheerio'); async function scrapeData() { try { // send an HTTP GET request to the web page const response = await axios.get('https://example.com/page'); // parse the response as HTML const $ = cheerio.load(response.data); // use cheerio to find the data you want to scrape const data = $('#hidden-data').text(); console.log(data); } catch (error) { console.error(error); } } scrapeData();

In this example, the Axios library is used to send an HTTP GET request to the web page, and the Cheerio library is used to parse the HTML response and find the element with the ID hidden-data. The text content of this element is then logged to the console.

You can use similar techniques to scrape other types of data, such as data from APIs or from XML documents. You can also use Axios and Cheerio to submit form data, follow links, and perform other actions on web pages.

For more information, you can read the documentation for Axios and Cheerio:

Axios: https://github.com/axios/axios Cheerio: https://cheerio.js.org/

Using puppeteer it makes life easier.

This tutorial help to learn about Introduction to Puppeteer

You can get it this code.

const puppeteer = require('puppeteer');
const { highlight } = require('pretty-html-log');

async function getList() {
    try {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        await page.setDefaultNavigationTimeout(0);
        await page.goto('https://www.efarma.com/rinazina-spray-nasale-decongestionante-nafazolina-0-1-15-ml.html');

        // extract the list from web page by id
        const list = await page.evaluate(() => {
            return document.querySelector('#product-attribute-specs-table').outerHTML.trim();
        });
        await browser.close();
        return Promise.resolve(list);
    } catch (error) {
        return Promise.reject(error);
    }
}

getList()
    .then((result) => console.log(highlight(result)))
    .catch(error => console.log(error));

Result

$ node get-data.js
<ul
  class="data list additional-attributes ml-0 p-0"
  id="product-attribute-specs-table"
>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Codice:</span>
    <span class="data" data-th="Codice">000590051</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Marchio:</span>
    <a
      href="https://www.efarma.com/rinazina.html"
      class="text-underline link-secondary"
    >
      <span class="data" data-th="Marchio">Rinazina</span>
    </a>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Produttore:</span>
    <span class="data" data-th="Produttore">Glaxosmithkline C.Health Srl</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formato:</span>
    <span class="data" data-th="Formato">Spray</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Formulazione:</span>
    <span class="data" data-th="Formulazione">Soluzione</span>
  </li>
  <li class="list-unstyled attribute text-body font-body-sm">
    <span class="label font-weight-bold">Capacità:</span>
    <span class="data" data-th="Capacità">0 - 50 ml</span>
  </li>
</ul>

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM