简体   繁体   English

使用 Javascript 获取最终 HTML 将 Java 渲染为字符串

[英]Getting Final HTML with Javascript rendered Java as String

I want to fetch data from an HTML page(scrape it).我想从 HTML 页面中获取数据(抓取它)。 But it contains reviews in javascript.但它包含 javascript 中的评论。 In normal java url fetch I am only getting the HTML(actual one) without Javascript executed.在普通的 java url fetch 中,我只得到 HTML(实际的),而没有执行 Javascript。 I want the final page with Javascript executed.我想要执行 Javascript 的最终页面。

Example :- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp示例:- http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp

This page has comments as a facebook plugin which are fetched as Javascript.这个页面有作为 facebook 插件的评论,这些评论是作为 Javascript 获取的。

Also similar to this even on this.甚至在这方面也与此类似。 http://www.imdb.com/title/tt0848228/reviews http://www.imdb.com/title/tt0848228/reviews

What should I do?我该怎么办?

Use phantomjs : http://phantomjs.org使用phantomjshttp : //phantomjs.org

var page = require('webpage').create();
page.open("http://www.glamsham.com/movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
    // Where you want to save it    
    page.render("screenshoot.png")  
    // You can access its content using jQuery
    var fbcomments = page.evaluate(function(){
        return $(".fb-comments iframe").contents().find(".postContainer") 
    }) 
},10000)

You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)您必须使用 phantom --web-security=no的选项来允许跨域交互(即对于 facebook iframe)

To communicate with other applications from phantomjs you can use a web server or make a POST request:https://github.com/ariya/phantomjs/blob/master/examples/post.js要从 phantomjs 与其他应用程序通信,您可以使用 Web 服务器或发出 POST 请求:https ://github.com/ariya/phantomjs/blob/master/examples/post.js

You can use HTML Unit , A java based "GUI LESS Browser".您可以使用HTML Unit ,一个基于 Java 的“GUI LESS 浏览器”。 You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output.您可以轻松获得任何页面的最终呈现输出,因为这会像 Web 浏览器一样加载页面并返回最终呈现的输出。 You can disable this behaviour though.不过,您可以禁用此行为。

UPDATE: You were asking for example?更新:你问的是例如? You don't have to do anything extra for doing that:您不必为此做任何额外的事情:

Example:例子:

WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));

UPDATE 2: You can get iframe as follows:更新 2:您可以按如下方式获取 iframe:

HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();

Please read the documentation from above link.请阅读上面链接中的文档。 There is nothing you can't do about getting page content in HTMLUnit在 HTMLUnit 中获取页面内容没有什么不能做的

The simple way to solve that problem.解决这个问题的简单方法。 Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.您好,您可以使用HtmlUnit是 java API,我认为它可以帮助您访问执行的 js 内容,作为一个简单的 html。

WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM