简体   繁体   中英

How to scrape Javascript rendered websites using Javascript?

I'm trying to scrape the $('a[href^="mailto:"]') of this website: https://celsius.network/

When I go to the browser console and run that, I get a link so I know it's there.

The issue is that my request (using the Axios library) returns the DOM before javascript is loaded. I've set the User-Agent, but it looks like it's not working.

const axiosClient = () =>
  axios.create({
    headers: {
      "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"
    },
    timeout: 10000
  });


axiosClient()
  .get("https://celsius.network")
  .then(({ data }) => {
    console.log("DATAAAAAAAA: ", data);
  })

This is returning the original HTML, with the body:

<body>
  <div id="app"> </div>
  ....

instead of the one that's fully loaded after all the javascript has manipulated the DOM.

PS I am doing this through firebase functions, so I think there are limits to what I can install.

UPDATE

const findEmail = url =>
  new Promise((resolve, reject) => {
     // here!
  });

Your request approach isn't enough to emulate what you'd expect while visiting a page in your browser. While there are some choices out there, puppeteer may be a candidate for the job.

Most things that you can do manually in the browser can be done using Puppeteer!

Check out the following...

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://celsius.network/');
  const textContent = await page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent);

  console.log(textContent); // presale@celsius.network

  browser.close();
})();

I'm not totally clear on your constraints...

there are limits to what I can install

If you have axios, I'd assume you can install this npm package?


Per your update, puppeteer can also be crafted via the promise api. The following should do it for you...

const findEmail = url =>
  new Promise((resolve, reject) => {
    puppeteer.launch().then((browser) => {
      browser.newPage().then((page) => {
        page.goto('https://celsius.network/').then(() => {
          page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent).then((element) => {
            resolve(element);
            browser.close();
          });
        });
      });
    });
  });

findEmail().then((email) => {
  console.log(email); // presale@celsius.network
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM