简体   繁体   中英

Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.

For example if the actual html code inside the element is "How to fix"

Running this JavaScript:

getElementById('myE').innerHTML

Gives me "How to fix" which is the parsed HTML.

How can I get the unparsed "How to fix" using JavaScript?

You cannot get the actual HTML source of part of your web page.

When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned. The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix —but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw (   ).

When you ask the browser for an element node's innerHTML , it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information. Instead, it generates new HTML from the data stored in the DOM. The browser decides on how to format that HTML serialisation; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.

In particular,

  • element names may be upper- or lower-cased;

  • attributes may not be in the same order as you stated them in the HTML;

  • attribute quoting may not be the same as in your source. IE often generates unquoted attributes that aren't even valid HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML ;

  • it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes. Instead of returning   it may simply give you the raw  character.

You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one. You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to   ... if you do, for some reason, you can get that by doing:

x= el.innerHTML.replace(/\xA0/g, ' ')

but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.

If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL ( location.href ) and get the full, unparsed HTML source in the responseText . There is almost never a good reason to do this.

What you have should work:

Element test:

<div id="myE">How to&nbsp;fix</div>​

JavaScript test:

alert(document.getElementById("myE​​​​​​​​").innerHTML); //alerts "How to&nbsp;fix"

You can try it out here . Make sure that wherever you're using the result isn't show &nbsp; as a space, which is likely the case. If you want to show it somewhere that's designed for HTML, you'll need to escape it.

You can use a script tag instead, which will not parse the HTML. This is more relevant when there are angle brackets, like loading a lodash or underscore template.

 document.getElementById("asDiv").value = document.getElementById("myDiv").innerHTML; document.getElementById("asScript").value = document.getElementById("myScript").innerHTML;
 <div id="myDiv"> <h1> <%= ${var} %> %> How to&nbsp;fix </h1> </div> <script id="myScript" type="text/template"> <h1> <%= ${var} %> How to&nbsp;fix </h1> </script> <textarea rows="10" cols="40" id="asDiv"></textarea> <textarea rows="10" cols="40" id="asScript"></textarea>

Because the HTML in a div is parsed, the inner HTML for brackets comes back as

&lt;

, but as a script it does not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM