简体   繁体   English

在(任何)Java 程序(访问呈现的 DOM 树)中呈现 JavaScript 和 HTML?

[英]Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?什么是“完全下载任何网页并呈现内置 JavaScript,然后以编程方式访问呈现的网页(即 DOM 树!)并将 DOM 树作为“HTML 源”的最佳 Java 库? ?

(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...) (类似于 firebug 最后所做的事情,它呈现页面,我可以访问完全呈现的 DOM 树,就像页面在浏览器中的样子一样!相反,如果我点击“显示源代码”,我只能获得 JavaScript 源代码代码。这不是我想要的。我需要访问呈现的页面...)

(With rendering I mean only rendering the DOM Tree not a visual rendering...) (渲染我的意思是只渲染 DOM 树而不是视觉渲染......)

This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...这不一定是一个单一的库,可以有多个库一起完成(一个下载,一个渲染……),但由于 JavaScript 的动态特性,JavaScript 库很可能也必须有某种下载器来完全呈现任何异步 JS ......

Background:背景:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler.在“过去的美好时光”中,HttpClient(Apache 库)是构建您自己的非常简单的爬虫所需的一切。 (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them) My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before... (很多爬虫比如Nutch或者Heretrix还是围绕着这个核心原理构建的,主要是做Standard HTML解析,所以学不来)我的问题是需要爬一些依赖JavaScript的网站我无法使用 HttpClient 进行解析,因为我绝对需要在之前执行 JavaScript...

This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...这有点开箱即用,但是如果您计划在可以完全控制环境的服务器中运行代码,它可能会起作用......

Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.在你的机器上安装 Firefox(或 XulRunner,如果你想保持轻量级)。

Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.使用 Firefox 插件系统,编写一个小插件,它加载给定的 URL,等待几秒钟,然后将页面的 DOM 复制到一个字符串中。

From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.从此插件中,使用 Java LiveConnect API(请参阅http://jdk6.java.net/plugin2/liveconnect/https://developer.mozilla.org/en/LiveConnect )将该字符串推送到公共静态函数在一些嵌入式 Java 代码中,它可以自己进行所需的处理,也可以将其转为一些更复杂的代码。

Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable.优点:您使用的是大多数应用程序开发人员针对的浏览器,因此观察到的行为应该具有可比性。 You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.您还可以沿着正常的升级路径升级浏览器,这样您的库就不会随着 HTML 标准的变化而过时。

Disadvantages: You will need to have permission to start a non-headless application on your server.缺点:您需要获得在服务器上启动非无头应用程序的权限。 You'll also have the complexity of inter-process communication to worry about.您还需要担心进程间通信的复杂性。

I have used the plugin API to call Java before, and it's quite achievable.之前用插件API调用过Java,还是可以实现的。 If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser.如果您想要一些示例代码,您应该查看 XQuery 插件 - 它从 DOM 加载 XQuery 代码,将其传递给 Java Saxon 库进行处理,然后将结果推送回浏览器。 There are some details about it here:这里有一些关于它的细节:

https://developer.mozilla.org/en/XQuery https://developer.mozilla.org/en/XQuery

You can use JavaFX 2 WebEngine .您可以使用 JavaFX 2 WebEngine Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.下载JavaFX SDK (如果您安装了 JDK7u2 或更高版本,您可能已经拥有它)并尝试下面的代码。

It will print html with processed javascript.它将使用处理过的 javascript 打印 html。 You can uncomment lines in the middle to see rendering as well.您也可以取消注释中间的行以查看渲染。

public class WebLauncher extends Application {

    @Override
    public void start(Stage stage) {
        final WebView webView = new WebView();
        final WebEngine webEngine = webView.getEngine();
        webEngine.load("http://stackoverflow.com");
        //stage.setScene(new Scene(webView));
        //stage.show();

        webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
            @Override
            public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
                if (newValue.intValue() == 100 /*percents*/) {
                    try {
                        org.w3c.dom.Document doc = webEngine.getDocument();
                        new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
                    } catch (IOException ex) { 
                        ex.printStackTrace();
                    }
                }
            }
        });

    }

    public static void main(String[] args) {
        launch();
    }

}

The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Selenium库通常用于测试,但确实可以让您远程控制大多数标准浏览器(IE、Firefox 等)以及无头、浏览器自由模式(使用 HtmlUnit)。 Because it is intended for UI verification by page scraping, it may well serve your purposes.因为它旨在通过页面抓取进行 UI 验证,所以它很可能符合您的目的。

In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.根据我的经验,它有时会遇到非常慢的 JavaScript,但通过小心使用“等待”命令,您可以获得非常可靠的结果。

It also has the benefit that you can actually drive the page, not just scrape it.它还具有您可以实际驱动页面的好处,而不仅仅是抓取它。 That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.这意味着如果您在获得所需数据之前在页面上执行一些操作(单击搜索按钮,单击下一步,现在抓取),那么您可以将其编码到流程中。

I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.我不知道您是否能够从 Selenium 以可导航的形式获取完整的 DOM,但它确实为页面的各个部分提供了 XPath 检索,这通常是抓取应用程序所需要的。

You can use Java, Groovy with or without Grails.您可以在有或没有 Grails 的情况下使用 Java、Groovy。 Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case.然后使用 Webdriver、Selenium、Spock 和 Geb 这些用于测试目的,但这些库对您的情况很有用。 You can implement a Crawler that won't open a new window but just a runtime of these either browser.您可以实现一个不会打开新窗口而只是这些浏览器的运行时的爬虫。

You can try JExplorer.你可以试试 JExplorer。 For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html有关更多信息,请参阅http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html

You can also try Cobra, see http://lobobrowser.org/cobra.jsp你也可以试试 Cobra,见http://lobobrowser.org/cobra.jsp

I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.我还没有尝试过这个项目,但我已经看到了几个包含 javascript dom 操作的 node.js 实现。

https://github.com/tmpvar/jsdom https://github.com/tmpvar/jsdom

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM