简体繁体 English

在纯Java中检索呈现的HTML DOM

[英]Retrieve rendered HTML DOM in pure Java

原文 2012-01-31 16:05:26 8 3 java/ javascript/ ajax/ browser/ rendering

I know there are already some similar questions here. 我知道这里已经有一些类似的问题。 But I do not want to build a browser in Java, I only want to see the source code fully generated (or "rendered"). 但是我不想用Java构建浏览器，我只想看到完全生成（或“渲染”）的源代码。 As if I look at the generated DOM in the browser. 好像我在浏览器中查看生成的DOM。 Does anybody know a tool for that? 有人知道这个工具吗？

I had a look at Cobra and HtmlUnit , but they dont seem to be able to render more complex websites correctly. 我看过Cobra和HtmlUnit ，但他们似乎无法正确呈现更复杂的网站。 Especially if there are AJAX calls adding content to the site after it has loaded. 特别是如果有AJAX调用在加载后向网站添加内容。 I really need a tool that does the same as a browser does, without the actual display of it. 我真的需要一个与浏览器一样的工具，而不需要实际显示它。 Do I have to remote control a browser in the end? 我最近是否需要远程控制浏览器？

Does anybody has experience with that? 有人有经验吗？

A very similar question but without any satisfying answeres can be found here . 可以在这里找到一个非常相似的问题，但没有任何令人满意的答案。

3 个解决方案

I don't believe that a library exists that does scraping of the asynchronous calls after the page is loaded. 我不相信存在在加载页面后抓取异步调用的库。

My recommendation is: 我的建议是：

Get the HTML of a page using Cobra or a similar library. 使用Cobra或类似的库获取页面的HTML。
Parse the source for AJAX requests. 解析AJAX请求的来源。 (for example, the ajax call will have a URL parameter and a "data" JSON string you can use for the request) （例如，ajax调用将具有URL参数和可用于请求的“数据”JSON字符串）
For each AJAX call, make another request to the URL parameter you captured. 对于每个AJAX调用，请对您捕获的URL参数发出另一个请求。
Append the result from each AJAX call to the source of your HTML from the original page. 将每个AJAX调用的结果附加到原始页面的HTML源代码中。

It's not a perfect solution and it will not help you in the scenarios that require the user to trigger an event. 它不是一个完美的解决方案，它不会帮助您在需要用户触发事件的场景中。 Also your code for capturing the URLs for the AJAX events will differ depending on what javascript library the website is using to make its async calls. 此外，用于捕获AJAX事件的URL的代码将根据网站用于进行异步调用的javascript库而有所不同。

Hope that helps. 希望有所帮助。

I have to answer this myself... In the end the best solution I found was actually HtmlUnit. 我必须自己回答...最后我找到的最佳解决方案实际上是HtmlUnit。 It is just too slow for my needs. 这对我的需求来说太慢了。 So I built my own tool, that of course needs manual setup to call the required links. 所以我构建了自己的工具，当然需要手动设置来调用所需的链接。 But thus it does not have to wait for any js timeouts or alike, but parses the requested information from the page and does the desired calls. 但是因此它不必等待任何js超时或类似，但是从页面解析所请求的信息并进行所需的调用。 Its a lot of manual work, but it looks like there is no other solution that works fast enough. 它有很多手工工作，但看起来没有其他解决方案可以运行得足够快。

Selenium do some what similar to this. Selenium做了类似的事情。 You need to install selenium remote control on your machine. 您需要在机器上安装selenium遥控器。 Then you can pass url request to the selenium. 然后你可以将url请求传递给selenium。 Selenium will open a browser a render the html/dhtml page metioned in the url. Selenium会打开一个浏览器，在网址中提到html / dhtml页面。 After that you can get the entire dom by querying to the selenium. 之后，您可以通过查询硒获得整个dom。 you can do all these thing by coding 你可以通过编码完成所有这些事情

http://seleniumhq.org/ please note: You need to install either slenium webdriver or selenium remote control, not selenium ide. http://seleniumhq.org/请注意：你需要安装slenium webdriver或selenium遥控器，而不是selenium ide。