简体   繁体   English

获取未解析的(原始)HTML 和 JavaScript

[英]Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.我需要获取 web 页面中元素的实际 html 代码。

For example if the actual html code inside the element is "How to fix"例如,如果元素内的实际 html 代码是"How to fix"

Running this JavaScript:运行这个 JavaScript:

getElementById('myE').innerHTML

Gives me "How to fix" which is the parsed HTML.给我"How to fix" ,这是已解析的 HTML。

How can I get the unparsed "How to fix" using JavaScript?如何使用 JavaScript 获取未解析的"How to fix"

You cannot get the actual HTML source of part of your web page.您无法获得部分网页的实际HTML 源代码。

When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned.当您向 Web 浏览器提供 HTML 页面时,它会将 HTML 解析为一些 DOM 节点,这些节点是浏览器所关注的文档的最终版本。 The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix —but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw ( DOM 保留了来自 HTML 的重要信息——就像你在单词fix之前使用了 Unicode 字符 U+00A0 Non-Breaking Space——但不是你通过实体引用而不是直接输入原始信息使用它的无关信息(   ). )。

When you ask the browser for an element node's innerHTML , it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information.当您向浏览器询问元素节点的innerHTML ,它不会为您提供经过解析以生成该节点的原始 HTML 源代码,因为它不再具有该信息。 Instead, it generates new HTML from the data stored in the DOM.相反,它从存储在 DOM 中的数据生成新的 HTML。 The browser decides on how to format that HTML serialisation;浏览器决定如何格式化 HTML 序列化; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.不同的浏览器会生成不同的 HTML,而且很可能与您最初对其进行格式化的方式不同。

In particular,特别是,

  • element names may be upper- or lower-cased;元素名称可以大写或小写;

  • attributes may not be in the same order as you stated them in the HTML;属性的顺序可能与您在 HTML 中声明的顺序不同;

  • attribute quoting may not be the same as in your source.属性引用可能与您的来源不同。 IE often generates unquoted attributes that aren't even valid HTML; IE 经常生成不带引号的属性,这些属性甚至不是有效的 HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML ;您可以确定的是,通过将生成的innerHTML写入另一个元素的innerHTML ,可以安全地在同一浏览器中使用它;

  • it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes.除了无法直接包含在文本内容中的字符外,它可能不会使用实体引用:&、小于和属性值引用。 Instead of returning  而不是返回  it may simply give you the raw它可能只是给你原始的  character.性格。

You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one.您可能无法看到这是一个不间断的空间,但它仍然是一个,如果您将该 HTML 插入另一个元素,它将作为一个元素。 You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to  您不应该依赖于实体转义为 的不间断空格字符的任何地方  ... if you do, for some reason, you can get that by doing: ...如果你这样做,出于某种原因,你可以通过这样做:

x= el.innerHTML.replace(/\xA0/g, ' ')

but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.但这只是转义 U+00A0 而不是其他数千个可能的 Unicode 字符中的任何一个,所以这有点值得怀疑。

If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL ( location.href ) and get the full, unparsed HTML source in the responseText .如果您真的需要获取页面的实际 HTML 源代码,您可以向您自己的 URL ( location.href ) 创建一个XMLHttpRequest并在responseText获取完整的、未解析的 HTML 源代码。 There is almost never a good reason to do this.几乎从来没有一个很好的理由这样做。

What you have should work:你有什么应该工作:

Element test:元素测试:

<div id="myE">How to&nbsp;fix</div>​

JavaScript test: JavaScript 测试:

alert(document.getElementById("myE​​​​​​​​").innerHTML); //alerts "How to&nbsp;fix"

You can try it out here .你可以在这里试一试 Make sure that wherever you're using the result isn't show &nbsp;确保无论您在哪里使用结果都不会显示&nbsp; as a space, which is likely the case.作为一个空间,这很可能是这种情况。 If you want to show it somewhere that's designed for HTML, you'll need to escape it.如果你想在专为 HTML 设计的地方展示它,你需要转义它。

You can use a script tag instead, which will not parse the HTML. This is more relevant when there are angle brackets, like loading a lodash or underscore template.您可以改用脚本标签,它不会解析 HTML。这在有尖括号时更为相关,例如加载 lodash 或下划线模板。

 document.getElementById("asDiv").value = document.getElementById("myDiv").innerHTML; document.getElementById("asScript").value = document.getElementById("myScript").innerHTML;
 <div id="myDiv"> <h1> <%= ${var} %> %> How to&nbsp;fix </h1> </div> <script id="myScript" type="text/template"> <h1> <%= ${var} %> How to&nbsp;fix </h1> </script> <textarea rows="10" cols="40" id="asDiv"></textarea> <textarea rows="10" cols="40" id="asScript"></textarea>

Because the HTML in a div is parsed, the inner HTML for brackets comes back as因为解析了一个div中的HTML,所以里面的括号HTML返回为

&lt;

, but as a script it does not. ,但作为脚本它没有。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM