简体   繁体   中英

Convert Unicode characters to hex causes extra bytes

Here is the code that I use for escaping muti-bytes unicode characters.

let sample = '1F3C4-1F3FB-200D-2640-FE0F'; //🏄🏻‍♀️
let characters = String.fromCodePoint(...sample.split('-').map(code => parseInt(code, 16)));
let codes = '';
for(let i=0;i<characters.length;i++){
    codes += (i === 0 ? '' : '-') + characters.codePointAt(i).toString(16).toUpperCase();
}
console.log(codes); //1F3C4-DFC4-1F3FB-DFFB-200D-2640-FE0F

As you can see from the example, the conversion causes 2 extra bytes in the result.

Is there anything wrong with my code? How can I fix it?

Apparently, the codePointAt function gives "a number representing the code unit value of the character at the given index". However, the index is the same as for charCodeAt , so if that index is in the middle of a surrogate pair (such as \?\? for \\u{1F3C4} ), it will only give the second half of the surrogate pair.

You can see this in your output, since the extra characters appear right after the two characters with surrogate pairs ( U+1xxxx characters) and that they are the second half of their preceding surrogate pair.

If you are using ES6, you can use the spread operator to split the unicode characters (while not splitting the surrogate pairs, like string .split() does):

 const string = "\\u{1F3C4}\\u{1F3FB}\‍\♀\️"; console.log(string); // 🏄🏻‍♀️ const codes = [ ...string ].map(ch => ch.codePointAt(0).toString(16).toUpperCase() ).join('-'); console.log(codes); // 1F3C4-1F3FB-200D-2640-FE0F

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM