[英]Why are ASCII values of a byte different when cast as Int32?
I'm in the process of creating a program that will scrub extended ASCII characters from text documents.我正在创建一个程序,该程序将从文本文档中清除扩展的 ASCII 字符。 I'm trying to understand how C# is interpreting the different character sets and codes, and am noticing some oddities.
我试图了解 C# 如何解释不同的字符集和代码,并注意到一些奇怪的地方。
Consider:考虑:
namespace ASCIITest
{
class Program
{
static void Main(string[] args)
{
string value = "Slide™1½”C4®";
byte[] asciiValue = Encoding.ASCII.GetBytes(value); // byte array
char[] array = value.ToCharArray(); // char array
Console.WriteLine("CHAR\tBYTE\tINT32");
for (int i = 0; i < array.Length; i++)
{
char letter = array[i];
byte byteValue = asciiValue[i];
Int32 int32Value = array[i];
//
Console.WriteLine("{0}\t{1}\t{2}", letter, byteValue, int32Value);
}
Console.ReadLine();
}
}
}
Output from program程序输出
CHAR BYTE INT32
S 83 83
l 108 108
i 105 105
d 100 100
e 101 101
T 63 8482 <- trademark symbol
1 49 49
½ 63 189 <- fraction
" 63 8221 <- smartquotes
C 67 67
4 52 52
r 63 174 <- registered trademark symbol
In particular, I'm trying to understand why the extended ASCII characters (the ones with my notes added to the right of the third column) show up with the correct value when cast as int32
, but all show up as 63
when cast as the byte
value.特别是,我试图理解为什么扩展的 ASCII 字符(我的注释添加到第三列右侧的那些字符)在转换为
int32
时显示正确的值,但在转换为63
时全部显示为63
byte
值。 What's going on here?这里发生了什么?
ASCII.GetBytes
conversion replaces all characters outside of ASCII range (0-127) with question mark (code 63). ASCII.GetBytes
转换将 ASCII 范围 (0-127) 之外的所有字符替换为问号(代码 63)。
So since your string contains characters outside of that range your asciiValue
have ?
因此,由于您的字符串包含该范围之外的字符,因此您的
asciiValue
有?
instead of all interesting symbols like ™
- its Char
(Unicode) repesentation is 8482 which is indeed outside of 0-127 range.而不是像
™
这样的所有有趣的符号 - 它的Char
(Unicode)表示是 8482,这确实在 0-127 范围之外。
Converting string to char array does not modify values of characters and you still have original Unicode codes ( char
is essentially Int16
) - casting it to longer integer type Int32
does not change the value.将字符串转换为 char 数组不会修改字符的值,并且您仍然拥有原始的 Unicode 代码(
char
本质上是Int16
)-将其转换为更长的整数类型Int32
不会更改该值。
Below are possible conversion of that character into byte/integers:以下是该字符到字节/整数的可能转换:
var value = "™";
var ascii = Encoding.ASCII.GetBytes(value)[0]; // 63(`?`) - outside 0-127 range
var castToByte = (byte)(value[0]); // 34 = 8482 % 256
var Int16 = (Int16)value[0]; // 8482
var Int32 = (Int16)value[0]; // 8482
Details available at ASCIIEncoding Class ASCIIEncoding 类中可用的详细信息
ASCIIEncoding corresponds to the Windows code page 20127. Because ASCII is a 7-bit encoding, ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
ASCIIEncoding 对应于 Windows 代码页 20127。因为 ASCII 是 7 位编码,ASCII 字符被限制为最低的 128 个 Unicode 字符,从 U+0000 到 U+007F。 If you use the default encoder returned by the Encoding.ASCII property or the ASCIIEncoding constructor, characters outside that range are replaced with a question mark (?) before the encoding operation is performed.
如果使用 Encoding.ASCII 属性或 ASCIIEncoding 构造函数返回的默认编码器,则在执行编码操作之前,该范围之外的字符将替换为问号 (?)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.