简体   繁体   English

如何向Angular2站点发出HTTP请求(用于抓取)?

[英]How do I make a HTTP request (for scraping purposes) to an Angular2 site?

I'm trying to use a node server to scrape some information from an agular2 Application. 我正在尝试使用节点服务器从agular2应用程序中抓取一些信息。 The problem is that the response I get is the index.js file, essentially the "loading..." part of the page. 问题是我得到的响应是index.js文件,本质上是页面的“正在加载...”部分。

I'm using the npm request or request-promise package like this: 我正在使用npm request或request-promise包,如下所示:

var rp = require("request-promise");

rp('https://someurl.com')
    .then((html) => {
        // Do something with the response
    })
    .catch((err) => {
        console.log(err);
    })

But I can't figure out if it is possible to wait for the page to actually load. 但是我不知道是否可以等待页面实际加载。 I've looked at possibly using Angular Universal but I need to get the data after it has all loaded and the site owner is against using Universal. 我已经研究过可能使用Angular Universal,但是在所有数据加载完毕并且站点所有者反对使用Universal后,我需要获取数据。

Is there anyway to make this work? 反正有做这项工作吗?

First of all you need to get a fully rendered page. 首先,您需要获取完整呈现的页面。 Unfortunately, JS rendered web pages can't be reached without rendering process, but we can go through this process using the "headless" browsers, like PhantomJS. 不幸的是,没有渲染过程就无法访问JS渲染的网页,但是我们可以使用“无头”浏览器(例如PhantomJS)来完成此过程。

“A headless browser is a web browser without a graphical user interface. “无头浏览器是没有图形用户界面的Web浏览器。 Headless browsers provide automated control of a web page in an environment similar to popular web browsers” 无头浏览器可在类似于流行网络浏览器的环境中自动控制网页。”

Here I found a good example, which can suite you to move on: https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/ 在这里,我找到了一个很好的示例,可以使您继续前进: https : //www.r-bloggers.com/web-scraping-javascript-rendered-sites/

Also, you could check this article, about SEO for Angularjs powered sites, under the "Spitting out the HTML Pages" you can find useful information: https://www.yearofmoo.com/2012/11/angularjs-and-seo.html#sptting-out-the-html-pages 另外,您可以在“吐出HTML页面”下查看有关Angularjs支持的网站的SEO的这篇文章,以找到有用的信息: https ://www.yearofmoo.com/2012/11/angularjs-and-seo 。 HTML#sptting出最HTML的网页

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM