简体   繁体   English

基于javascript的网站有何不同?

[英]how are javascript based websites different?

I am trying to scrape the content of a website which seems to be working on javascript or some other technology. 我正在尝试抓取一个似乎在使用javascript或其他技术的网站的内容。 I am using xpath to find the content on the page. 我正在使用xpath在页面上查找内容。 I can see the content using firebug in the browser but if i save the source or download the source code via curl/wget, content is missing from the page. 我可以在浏览器中使用Firebug查看内容,但是如果我保存源代码或通过curl / wget下载源代码,则页面中缺少内容。 How is this possible ? 这怎么可能 ?

thanks in advance 提前致谢

Some content are loaded via JS dynamically. 一些内容是通过JS动态加载的。 You need to run the JS somehow, like in a headless browser like PhantomJS for several seconds in order to load dynamic content. 您需要以某种方式运行JS,例如在无头浏览器(如PhantomJS)中运行几秒钟,以便加载动态内容。 Then run through the DOM, similar to how .html() in jQuery does it, to get the rendered content. 然后遍历DOM,类似于jQuery中的.html()那样,以获取呈现的内容。

As far as I know, this is similar to how Opera Mini does it in their proxies before they re-encode and send it to your device: 据我所知,这类似于Opera Mini在重新编码并将其发送到您的设备之前在其代理中进行的操作:

The server sends the response back as normal — when this is received by the Opera transcoding servers, they parse the markup and styles, execute the JavaScript, and transcode the data into Opera Binary Markup Language (OBML). 服务器以正常方式发送回响应-Opera转码服务器接收到响应后,它们解析标记和样式,执行JavaScript,然后将数据转换为Opera Binary Markup Language(OBML)。 This OBML data is progressively loaded by Opera Mini on the user's device. 该OBML数据由Opera Mini在用户设备上逐步加载。

Opera Mini's entry from Wikipedia : Opera Mini 在Wikipedia上的条目:

JavaScript will only run for a couple of seconds on the Mini server before pausing, due to resource constraints. 由于资源限制,JavaScript在暂停前只能在Mini服务器上运行几秒钟。

According to the documentation for Opera Mini 4, before the page is sent to the mobile device, its onLoad events are fired and all scripts are allowed a maximum of two seconds to execute. 根据Opera Mini 4的文档,在将该页面发送到移动设备之前,将触发其onLoad事件,并且所有脚本最多可以执行两秒钟。 The setInterval and setTimeout functions are disabled, so scripts designed to wait a certain amount of time before executing will not execute at all. setInterval和setTimeout函数被禁用,因此设计为在执行之前等待一定时间的脚本根本不会执行。 After the scripts have finished or the timeout is reached, all scripts are stopped and the page is compressed and sent to the mobile device. 在脚本完成或达到超时之后,所有脚本都将停止并且页面将被压缩并发送到移动设备。

Typically the page loads and then requests the content (ajax) which is returned as json or jsonp. 通常,页面会加载,然后请求以json或jsonp返回的内容(ajax)。 This is usually pretty handy for scraping because json is even easier to parse than html. 这通常对于抓取非常方便,因为json比html更易于解析。

But if you haven't done it before, it can be a challenge to figure out how to make the right ajax request. 但是,如果您以前没有做过,那么弄清楚如何提出正确的ajax请求可能是一个挑战。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM