简体   繁体   中英

NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

I typed this into the nodejs console

new Buffer(new Buffer([0xde]).toString('utf8'), 'utf8')

and it prints out

<Buffer ef bf bd>

After looking at the docs it seems that this would produce an identical buffer. I'm creating a utf8 encoded string from a buffer whose contents consist of one byte (0xde) then using that utf8 encoded string to create a buffer. Am I missing something here?

For encodings that can be multi-byte, you cannot expect to get the same bytes back that you started with in all cases. In the case of UTF-8 , some characters require more than one byte to be represented properly.

In your example, 0xde exceeds 0x7f which is the largest value for a character that can be represented by a single byte. So when you then call .toString('utf8') , node sees that it only has one byte and instead returns the UTF-8 character \� ( 0xef, 0xbf, 0xbd in hex) which is used to denote an unknown/unrepresentable character. Then reading back in this "replacement character" value back into a new Buffer is no problem, as it is a valid UTF-8 character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM