简体繁体 English

Node.js中的Webscraper，JS修改DOM

[英]Webscraper in node.js, JS modifies DOM

原文 2018-05-17 09:50:46 2 1 javascript/ html/ node.js/ parsing/ web-scraping

I'm trying to write a webscraper, to get some sales leads. 我正在尝试写一个网络爬虫，以获得一些销售线索。 The problem is that in modern webdesign, most of websites uses some JavaScript to modify DOM (usually using React, Angular, or even just some jQuery). 问题在于，在现代网页设计中，大多数网站都使用某些JavaScript来修改DOM（通常使用React，Angular甚至只是一些jQuery）。 The problem is, that if I scrap some website by request node.js package, and pass html code to cheerio , then I'm simply not able to parse the code and get the info I want. 问题是，如果我通过request node.js包cheerio某些网站，并将html代码传递给cheerio ，那么我简直无法解析代码并获取我想要的信息。 Instead, all I can see are some React.js components ¯_ツ_/¯ Any resources on this topic will be helpful, thanks in advance. 相反，我只能看到一些React.js组件___ /。关于这个主题的任何资源都将有所帮助，谢谢。

1 个解决方案

Because the request package will not execute any of the javascript on the page. 因为请求包不会执行页面上的任何JavaScript。 It will just download the html as is. 它将按原样下载html。 If you want to see the actual page like a browser does, you would have to create a javascript parser that executes all javascript code in the state you want it to. 如果要像浏览器一样查看实际页面，则必须创建一个JavaScript解析器，以您希望的状态执行所有javascript代码。

Luckily, there are some other options here: 幸运的是，这里还有其他一些选择：

You could take a look at the developer tools on the website you want to scrape and try to find the xhr requests that fetches the data you need. 您可以看一下您要抓取的网站上的开发人员工具，并尝试查找获取所需数据的xhr请求。 Then you can call this url directly. 然后，您可以直接调用此网址。
You could use headless browser scraping like PhantomJS or CasperJS . 您可以使用PhantomJS或CasperJS之类的无头浏览器抓取工具。 These are packages that will try and modify the downloaded dom as good as possible with the included javascript resources. 这些程序包将尝试使用随附的javascript资源尽可能地修改下载的dom。