Convert Unicode characters to hex causes extra bytes

Question

Here is the code that I use for escaping muti-bytes unicode characters.

let sample = '1F3C4-1F3FB-200D-2640-FE0F'; //🏄🏻‍♀️
let characters = String.fromCodePoint(...sample.split('-').map(code => parseInt(code, 16)));
let codes = '';
for(let i=0;i<characters.length;i++){
    codes += (i === 0 ? '' : '-') + characters.codePointAt(i).toString(16).toUpperCase();
}
console.log(codes); //1F3C4-DFC4-1F3FB-DFFB-200D-2640-FE0F

As you can see from the example, the conversion causes 2 extra bytes in the result.

Is there anything wrong with my code? How can I fix it?

Answer 1

Apparently, the codePointAt function gives "a number representing the code unit value of the character at the given index". However, the index is the same as for charCodeAt , so if that index is in the middle of a surrogate pair (such as \?\? for \\u{1F3C4} ), it will only give the second half of the surrogate pair.

You can see this in your output, since the extra characters appear right after the two characters with surrogate pairs ( U+1xxxx characters) and that they are the second half of their preceding surrogate pair.

If you are using ES6, you can use the spread operator to split the unicode characters (while not splitting the surrogate pairs, like string .split() does):

 const string = "\\u{1F3C4}\\u{1F3FB}\‍\♀\️"; console.log(string); // 🏄🏻‍♀️ const codes = [ ...string ].map(ch => ch.codePointAt(0).toString(16).toUpperCase() ).join('-'); console.log(codes); // 1F3C4-1F3FB-200D-2640-FE0F

Convert Unicode characters to hex causes extra bytes

Question

1 answers

solution1
1 ACCPTED 2016-09-29 17:39:54

Convert Unicode characters to hex causes extra bytes

Question

1 answers

solution1 1 ACCPTED 2016-09-29 17:39:54

solution1
1 ACCPTED 2016-09-29 17:39:54