简体   繁体   English

如何使用Javascript抓取Javascript呈现的网站?

[英]How to scrape Javascript rendered websites using Javascript?

I'm trying to scrape the $('a[href^="mailto:"]') of this website: https://celsius.network/ 我正在尝试刮擦此网站的$('a[href^="mailto:"]')https : //celsius.network/

When I go to the browser console and run that, I get a link so I know it's there. 当我进入浏览器控制台并运行它时,我得到一个链接,所以我知道它在那里。

The issue is that my request (using the Axios library) returns the DOM before javascript is loaded. 问题是我的请求(使用Axios库)在加载javascript之前返回了DOM。 I've set the User-Agent, but it looks like it's not working. 我已经设置了User-Agent,但看起来好像没有用。

const axiosClient = () =>
  axios.create({
    headers: {
      "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"
    },
    timeout: 10000
  });


axiosClient()
  .get("https://celsius.network")
  .then(({ data }) => {
    console.log("DATAAAAAAAA: ", data);
  })

This is returning the original HTML, with the body: 这将返回带有主体的原始HTML:

<body>
  <div id="app"> </div>
  ....

instead of the one that's fully loaded after all the javascript has manipulated the DOM. 而不是在所有javascript处理完DOM之后完全加载的代码。

PS I am doing this through firebase functions, so I think there are limits to what I can install. PS我是通过firebase函数来执行此操作的,所以我认为我可以安装的内容受到限制。

UPDATE 更新

const findEmail = url =>
  new Promise((resolve, reject) => {
     // here!
  });

Your request approach isn't enough to emulate what you'd expect while visiting a page in your browser. 您的请求方法不足以模仿您在浏览器中访问页面时的期望。 While there are some choices out there, puppeteer may be a candidate for the job. 尽管有很多选择,但是伪娘可能是这份工作的候选人。

Most things that you can do manually in the browser can be done using Puppeteer! 您可以在浏览器中手动执行的大多数操作都可以使用Puppeteer完成!

Check out the following... 查看以下内容...

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://celsius.network/');
  const textContent = await page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent);

  console.log(textContent); // presale@celsius.network

  browser.close();
})();

I'm not totally clear on your constraints... 我不清楚您的限制...

there are limits to what I can install 我可以安装的东西有限制

If you have axios, I'd assume you can install this npm package? 如果您有axios,我假设您可以安装此npm软件包?


Per your update, puppeteer can also be crafted via the promise api. 根据您的更新,也可以通过promise API来制作人偶。 The following should do it for you... 以下应该为您做...

const findEmail = url =>
  new Promise((resolve, reject) => {
    puppeteer.launch().then((browser) => {
      browser.newPage().then((page) => {
        page.goto('https://celsius.network/').then(() => {
          page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent).then((element) => {
            resolve(element);
            browser.close();
          });
        });
      });
    });
  });

findEmail().then((email) => {
  console.log(email); // presale@celsius.network
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM