简体   繁体   中英

How to convert unicode characters to HTML numeric entities using plain Javascript

I'm trying to convert innerHTML with special characters into their original &#...; entity values but can't seem to get it working for unicode values. Where am I going wrong?

The code is trying to take "Orig" - encode it and place it into "Copy"....

Orig: 1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂

Copy: 1:🙂 __2:𝌆 __3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂

... but obviously the dreaded black diamonds are appearing!

 if (!String.prototype.codePointAt) { String.prototype.codePointAt = function(pos) { pos = isNaN(pos) ? 0 : pos; var str = String(this), code = str.charCodeAt(pos), next = str.charCodeAt(pos + 1); // If a surrogate pair if (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF) { return ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000; } return code; }; } /** * Encodes special html characters * @param string * @return {*} */ function html_encode(s) { var ret_val = ''; for (var i = 0; i < s.length; i++) { if (s.codePointAt(i) > 127) { ret_val += '&#' + s.codePointAt(i) + ';'; } else { ret_val += s.charAt(i); } } return ret_val; } var v = html_encode(document.getElementById('orig').innerHTML); document.getElementById('copy').innerHTML = v; document.getElementById('values').value = v; //console.log(v);
 div { padding:10px; border:solid 1px grey; } textarea { width:calc(100% - 30px); height:50px; padding:10px; }
 Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div> Copy:<div id='copy'></div> Values:<textarea id='values'></textarea>

(A jsfiddle is available at https://jsfiddle.net/Abeeee/k6e4svqa/24/ )

I've been through the various suggestions on How to convert characters to HTML entities using plain JavaScript , including the he.js which looks the most favourable, but when I downloaded that script it doesn't compile (Unexpected Token around line 32: .. var encodeMap = <%= encodeMap %>;).

I'm not sure where to go with this.

Javascript strings are UTF-16. A character in the surrogate range takes up two 16-bit words. The length property of a string is the count of the number of 16-bit words. Thus "🙂".length will return 2.

codePointAt(i) is not the i th character, but the i th 16-bit word. Hence, a surrogate character will appear over two consecutive codePointAt invocations. From the specs , if "🙂".toString(0) is the high surrogate, the function will return the code point value, ie 128578, but "🙂".toString(1) will return only the lower surrogate 56898, that black diamond.

Thus you need to skip one position if codePointAt returns a high surrogate.

Following the example in the specs, instead of iterating through each 16-bit word in the string, use a method that loops through each character . for let (char in aString) {} does just that.

 function html_encode(s) { var ret_val = ''; for (let char of s) { const code = char.codePointAt(0); if (code > 127) { ret_val += '&#' + code + ';'; } else { ret_val += char; } } return ret_val; } let v = html_encode(document.getElementById('orig').innerHTML); document.getElementById('copy').innerHTML = v; document.getElementById('values').value = v;
 div { padding:10px; border:solid 1px grey; } textarea { width:calc(100% - 30px); height:50px; padding:10px; }
 Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div> Copy:<div id='copy'></div> Values:<textarea id='values'></textarea>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM