NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

Question

I typed this into the nodejs console

new Buffer(new Buffer([0xde]).toString('utf8'), 'utf8')

and it prints out

<Buffer ef bf bd>

After looking at the docs it seems that this would produce an identical buffer. I'm creating a utf8 encoded string from a buffer whose contents consist of one byte (0xde) then using that utf8 encoded string to create a buffer. Am I missing something here?

Answer 1

For encodings that can be multi-byte, you cannot expect to get the same bytes back that you started with in all cases. In the case of UTF-8 , some characters require more than one byte to be represented properly.

In your example, 0xde exceeds 0x7f which is the largest value for a character that can be represented by a single byte. So when you then call .toString('utf8') , node sees that it only has one byte and instead returns the UTF-8 character \� ( 0xef, 0xbf, 0xbd in hex) which is used to denote an unknown/unrepresentable character. Then reading back in this "replacement character" value back into a new Buffer is no problem, as it is a valid UTF-8 character.

NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

Question

1 answers

solution1
1 2015-02-11 18:50:31

NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

Question

1 answers

solution1 1 2015-02-11 18:50:31

solution1
1 2015-02-11 18:50:31