简体   繁体   English

为什么当转换为 Int32 时一个字节的 ASCII 值不同?

[英]Why are ASCII values of a byte different when cast as Int32?

I'm in the process of creating a program that will scrub extended ASCII characters from text documents.我正在创建一个程序,该程序将从文本文档中清除扩展的 ASCII 字符。 I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.我试图了解 C# 如何解释不同的字符集和代码,并注意到一些奇怪的地方。

Consider:考虑:

namespace ASCIITest
{
    class Program
    {
        static void Main(string[] args)
        {
            string value = "Slide™1½”C4®";
            byte[] asciiValue = Encoding.ASCII.GetBytes(value);   // byte array
            char[] array = value.ToCharArray();                   // char array
            Console.WriteLine("CHAR\tBYTE\tINT32"); 
            for (int i = 0; i < array.Length; i++)
            {
                char  letter     = array[i];
                byte  byteValue  = asciiValue[i];
                Int32 int32Value = array[i];
                 //
                Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
            }
            Console.ReadLine();
        }
    }
}

Output from program程序输出

CHAR    BYTE    INT32
S       83      83
l       108     108
i       105     105
d       100     100
e       101     101
T       63      8482      <- trademark symbol
1       49      49
½       63      189       <- fraction
"       63      8221      <- smartquotes
C       67      67
4       52      52
r       63      174       <- registered trademark symbol

In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32 , but all show up as 63 when cast as the byte value.特别是,我试图理解为什么扩展的 ASCII 字符(我的注释添加到第三列右侧的那些字符)在转换为int32时显示正确的值,但在转换为63时全部显示为63 byte值。 What's going on here?这里发生了什么?

ASCII.GetBytes conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63). ASCII.GetBytes转换将 ASCII 范围 (0-127) 之外的所有字符替换为问号(代码 63)。

So since your string contains characters outside of that range your asciiValue have ?因此,由于您的字符串包含该范围之外的字符,因此您的asciiValue? instead of all interesting symbols like - its Char (Unicode) repesentation is 8482 which is indeed outside of 0-127 range.而不是像这样的所有有趣的符号 - 它的Char (Unicode)表示是 8482,这确实在 0-127 范围之外。

Converting string to char array does not modify values of characters and you still have original Unicode codes ( char is essentially Int16 ) - casting it to longer integer type Int32 does not change the value.将字符串转换为 char 数组不会修改字符的值,并且您仍然拥有原始的 Unicode 代码( char本质上是Int16 )-将其转换为更长的整数类型Int32不会更改该值。

Below are possible conversion of that character into byte/integers:以下是该字符到字节/整数的可能转换:

var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482 
var Int32 = (Int16)value[0]; // 8482 

Details available at ASCIIEncoding Class ASCIIEncoding 类中可用的详细信息

ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F. ASCIIEncoding 对应于 Windows 代码页 20127。因为 ASCII 是 7 位编码,ASCII 字符被限制为最低的 128 个 Unicode 字符,从 U+0000 到 U+007F。 If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.如果使用 Encoding.ASCII 属性或 ASCIIEncoding 构造函数返回的默认编码器,则在执行编码操作之前,该范围之外的字符将替换为问号 (?)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM