简体   繁体   English

使用Javascript将HTML字符串加载到DOM树中

[英]Load HTML string into DOM tree with Javascript

I'm currently working with an automation framework that is pulling a webpage down for analysis, which is then presented as a string for processing. 我目前正在使用一个自动化框架,它将网页拉下来进行分析,然后将其显示为一个字符串进行处理。 The Rhino Javascript engine is available to assist in parsing the returned web page. Rhino Javascript引擎可用于帮助解析返回的网页。

It seems that if the string (which is a complete webpage) can be loaded in a DOM representation, it would provide a very nice interface for parsing and analyzing content. 似乎如果字符串(它是一个完整的网页)可以加载到DOM表示中,它将为解析和分析内容提供一个非常好的界面。

Using only Javascript, is this a possible and/or feasible concept? 仅使用Javascript,这是一个可能和/或可行的概念吗?

Edit: 编辑:

I'll decompose the question for clarify: Say I have an string in javascript that contains html like such: 我将分解问题以澄清:说我在javascript中有一个包含html的字符串,如:


var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

is it possible/realistic to load it somehow into a dom object? 以某种方式将它加载到dom对象中是否可行/现实?

I'm accepting JonDavidJohn's answer as it was useful in solving my problem, thought including this additional answer for others that may view this in the future. 我接受了JonDavidJohn的答案,因为它有助于解决我的问题,包括为将来可能会看到这个的其他人提供这个额外的答案。

It appears that while Javascript allows the loading of html strings into a DOM element, DOM is not part of core ECMAScript, and as such is not available to scripts running under Rhino. 看来,虽然Javascript允许将html字符串加载到DOM元素中,但DOM不是核心ECMAScript的一部分,因此在Rhino下运行的脚本不可用。

As a side note worth mentioning, a good alternative that was implemented in Rhino 1.6 is E4X. 作为值得一提的旁注,在Rhino 1.6中实现的一个很好的替代方案是E4X。 While not a DOM implementation, it does provide for conceptually similar capabilities. 虽然不是DOM实现,但它确实提供了概念上类似的功能。

If the document is XHTML, you can parse it with any XML parser. 如果文档是XHTML,您可以使用任何XML解析器解析它。 E4X would probably do the job nicely, as would the built-in Java XML parsing interfaces. E4X可能会很好地完成工作,就像内置的Java XML解析接口一样。

The env.js library is designed to emulate the browser environment under Rhino, but I believe your document also needs to be compliant XHTML: env.js库旨在模拟Rhino下的浏览器环境,但我相信您的文档还需要符合XHTML标准:

http://ejohn.org/blog/bringing-the-browser-to-the-server/ http://ejohn.org/blog/bringing-the-browser-to-the-server/

http://www.envjs.com/ http://www.envjs.com/

If it's HTML, however, it's more difficult, as browsers are designed to be extremely lenient in how markup is parsed. 但是,如果它是HTML,那就更难了,因为浏览器的设计在解析标记方面非常宽松。 See here for a list of HTML parsers in Java: 请参阅此处获取Java中的HTML解析器列表:

http://java-source.net/open-source/html-parsers http://java-source.net/open-source/html-parsers

This is not an easy problem to solve. 这不是一个容易解决的问题。 People have gone so far as to embed the Mozilla Gecko engine in Java via JNI in order to use its parsing capabilities. 人们已经通过JNI将Mozilla Gecko引擎嵌入到Java中,以便使用它的解析功能。

I would recommend you look into the following pure-Java project: 我建议你看看下面的纯Java项目:

http://lobobrowser.org/cobra.jsp http://lobobrowser.org/cobra.jsp

The goal of the Lobo project is to develop a pure-Java web browser. Lobo项目的目标是开发纯Java Web浏览器。 It's a pretty interesting project, and there's a lot there, but I believe you could use the parser standalone quite easily in your own application, as described in the following link: 这是一个非常有趣的项目,那里有很多,但我相信你可以在你自己的应用程序中很容易地使用解析器独立,如下面的链接所述:

http://lobobrowser.org/cobra/java-html-parser.jsp http://lobobrowser.org/cobra/java-html-parser.jsp

if you have this variable that contains html, you can load it into a DOM object, for example, by id. 如果你有这个包含html的变量,你可以将它加载到DOM对象中,例如,通过id。

var mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

element = document.getElementById('dom-id');  //<-- element you are loading it into.

element.innerHTML = mywebpage;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM