简体繁体 English

TCP接收扩展的ASCII或utf-8字符

[英]TCP receiving extended ASCII or utf-8 characters

原文 2011-02-08 12:06:51 0 3 c/ tcp/ winsock

对于倒置问号¿我收到两个字节[-62] [-65]，但是我将如何获得可读的utf-8或ASCII字符编码？

3 个解决方案

That is the UTF8 code for that character. 这就是该字符的UTF8代码。 The inverted question mark is Unicode code point 191 which, in UTF8, is 0xc2:0xbf . 反向问号是Unicode代码点191 ，在utf8中为0xc2:0xbf 。

You're seeing them as signed bytes. 您正在将它们视为带符号的字节。 For example -62 signed is 256-62 or 194 unsigned - that's hex 0xc2 . 例如， -62符号是256-62或194无符号-这是十六进制0xc2 。

Similarly, -65 signed is 256-65 or 191 unsigned - that's hex 0xbf . 类似地， -65符号是256-65或191无符号-即十六进制0xbf 。

If you want to convert your UTF8 sequence into a code point, you can use the table below. 如果要将UTF8序列转换为代码点，可以使用下表。

Range              Encoding  Binary value
    -----------------  --------  --------------------------
    U+000000-U+00007f  0xxxxxxx  0xxxxxxx

    U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                       10xxxxxx

    U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                       10yyyyxx
                       10xxxxxx

    U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                       10zzyyyy
                       10yyyyxx
                       10xxxxxx

For example, your 0xc2:0xbf is binary 11000010 10111111 which matches the second case: 例如，您的0xc2:0xbf是二进制11000010 10111111 ，它与第二种情况匹配：

11000010 10111111
         |||||   ||||||
         |||\\  //////
         ||| ||||||||
    00000000 10111111  ->  0x00bf  ->  191

Those 2 bytes probably are UTF-8 那两个字节可能是 UTF-8

For ASCII you would need a specific codepage. 对于ASCII，您将需要特定的代码页。

And what exactly is a 'readable' char encoding? “可读”字符编码到底是什么？

Look at the byte values in hexadecimal: 查看十六进制的字节值：

-62 is 0xc2 -62是0xc2
-65 is 0xbf -65是0xbf

If you look up the Unicode information for the glyph in question, you can see that this is, inded, the two bytes that make up the UTF-8 encoding of the inverted question mark glyph. 如果查找有关字形的Unicode信息，则可以看到，这实际上是构成反向问号字形的UTF-8编码的两个字节。