简体   繁体   English

C语言:为什么int变量可以存储char?

[英]C Language: Why int variable can store char?

I am recently reading The C Programming Language by Kernighan. 我最近正在阅读Kernighan的C编程语言。

There is an example which defined a variable as int type but using getchar() to store in it. 有一个例子将变量定义为int类型,但使用getchar()存储在其中。

int x;
x = getchar();

Why we can store a char data as a int variable? 为什么我们可以将char数据存储为int变量? The only thing that I can think about is ASCII and UNICODE. 我唯一能想到的就是ASCII和UNICODE。 Am I right? 我对吗?

The getchar function (and similar character input functions) returns an int because of EOF . 由于EOFgetchar函数(和类似的字符输入函数)返回一个int There are cases when (char) EOF != EOF (like when char is an unsigned type). 有些情况下(char) EOF != EOF (就像charunsigned类型时)。

Also, in many places where one use a char variable, it will silently be promoted to int anyway. 此外,在许多使用char变量的地方,无论如何都会无声地将其提升int Ant that includes constant character literals like 'A' . 包含常量字符文字的Ant,如'A'

getchar is an old C standard function and the philosophy back then was closer to how the language gets translated to assembly than type correctness and readability. getchar是一个古老的C标准函数,当时的哲学更接近于语言如何转换为汇编而不是类型的正确性和可读性。 Keep in mind that compilers were not optimizing code as much as they are today. 请记住,编译器并没有像现在这样优化代码。 In C, int is the default return type (ie if you don't have a declaration of a function in C, compilers will assume that it returns int ), and returning a value is done using a register - therefore returning a char instead of an int actually generates additional implicit code to mask out the extra bytes of your value. 在C中, int是默认的返回类型(即如果你没有C语言中的函数声明,编译器将假定它返回int ),并且使用寄存器返回一个值 - 因此返回一个char而不是一个int实际上生成了额外的隐式代码来掩盖你的值的额外字节。 Thus, many old C functions prefer to return int . 因此,许多旧的C函数更喜欢返回int

C requires int be at least as many bits as char . C要求int至少与char一样多。 Therefore, int can store the same values as char (allowing for signed/unsigned differences). 因此, int可以存储与char相同的值(允许有符号/无符号差异)。 In most cases, int is a lot larger than char . 在大多数情况下, intchar大很多。

char is an integer type that is intended to store a character code from the implementation-defined character set, which is required to be compatible with C's abstract basic character set. char是一种整数类型,用于存储来自实现定义字符集的字符代码,该字符代码需要与C的抽象基本字符集兼容。 (ASCII qualifies, so do the source-charset and execution-charset allowed by your compiler, including the one you are actually using.) (ASCII符合条件,编译器允许的source-charset和execution-charset也是如此,包括你实际使用的那个。)

For the sizes and ranges of the integer types ( char included), see your <limits.h> . 有关整数类型的大小和范围(包括char ),请参阅<limits.h> Here is somebody else's limits.h . 这里是别人的limits.h中

getchar() attempts to read a byte from the standard input stream. getchar()尝试从标准输入流中读取一个字节。 The return value can be any possible value of the type unsigned char (from 0 to UCHAR_MAX ), or the special value EOF which is specified to be negative. 返回值可以是unsigned char类型的任何可能值(从0UCHAR_MAX ),或者指定为负数的特殊值EOF

On most current systems, UCHAR_MAX is 255 as bytes have 8 bits, and EOF is defined as -1 , but the C Standard does not guarantee this: some systems have larger unsigned char types (9 bits, 16 bits...) and it is possible, although I have never seen it, that EOF be defined as another negative value. 在大多数当前系统中, UCHAR_MAX255因为字节有8位, EOF定义为-1 ,但C标准不保证这一点:某些系统有更大的unsigned char类型(9位,16位......),它有可能,虽然我从未见过它, EOF被定义为另一个负值。

Storing the return value of getchar() (or getc(fp) ) to a char would prevent proper detection of end of file. getchar() (或getc(fp) )的返回值存储到char将阻止正确检测文件结尾。 Consider these cases (on common systems): 考虑这些情况(在常见系统上):

  • if char is an 8-bit signed type, a byte value of 255 , which is the character ÿ in the ISO8859-1 character set, has the value -1 when converted to a char . 如果char是8位有符号类型,则字节值255 (ISO8859-1字符集中的字符ÿ在转换为char时具有值-1 Comparing this char to EOF will yield a false positive. 将此charEOF进行比较将产生误报。

  • if char is unsigned, converting EOF to char will produce the value 255 , which is different from EOF , preventing the detection of end of file. 如果char是无符号的,则将EOF转换为char将产生值255 ,这与EOF不同,从而阻止检测到文件结尾。

These are the reasons for storing the return value of getchar() into an int variable. 这些是将getchar()的返回值存储到int变量中的原因。 This value can later be converted to a char , once the test for end of file has failed. 一旦文件结束测试失败,此值稍后可以转换为char

Storing an int to a char has implementation defined behavior if the char type is signed and the value of the int is outside the range of the char type. 如果char类型已签名且int的值超出char类型的范围,则将int存储到char具有实现定义的行为。 This is a technical problem, which should have mandated the char type to be unsigned, but the C Standard allowed for many existing implementations where the char type was signed. 这是一个技术问题,应该强制char类型是无符号的,但C标准允许许多现有的char类型被签名的实现。 It would take a vicious implementation to have unexpected behavior for this simple conversion. 这种简单的转换会产生意想不到的行为。

The value of the char does indeed depend on the execution character set. char的值确实取决于执行字符集。 Most current systems use ASCII or some extension of ASCII such as ISO8859-x, UTF-8, etc. But the C Standard supports other character sets such as EBCDIC, where the lowercase letters do not form a contiguous range. 大多数当前系统使用ASCII或某些ASCII扩展,如ISO8859-x,UTF-8等。但C标准支持其他字符集,如EBCDIC,其中小写字母不形成连续范围。

C was designed as a very low-level language, so it is close to the hardware. C被设计为一种非常低级的语言,因此它非常接近硬件。 Usually, after a bit of experience, you can predict how the compiler will allocate memory, and even pretty accurately what the machine code will look like. 通常,经过一些经验,您可以预测编译器将如何分配内存,甚至可以准确地预测机器代码的外观。

Your intuition is right: it goes back to ASCII. 你的直觉是正确的:它可以追溯到ASCII。 ASCII is really a simple 1:1 mapping from letters (which make sense in human language) to integer values (that can be worked with by hardware); ASCII实际上是一个简单的1:1映射,从字母(在人类语言中有意义)到整数值(可由硬件处理); for every letter there is an unique integer. 对于每个字母,都有一个唯一的整数。 For example, the 'letter' CTRL-A is represented by the decimal number '1'. 例如,'字母'CTRL-A由十进制数'1'表示。 (For historical reasons, lots of control characters came first - so CTRL-G, which rand the bell on an old teletype terminal, is ASCII code 7. Upper-case 'A' and the 25 remaining UC letters start at 65, and so on. See http://www.asciitable.com/ for a full list.) (由于历史原因,许多控制字符首先出现 - 因此CTRL-G在旧的电传终端上敲响了铃声,是ASCII码7.大写'A'和剩下的25个UC字母从65开始,所以请参阅http://www.asciitable.com/获取完整列表。)

C lets you 'coerce' variables into other types. C允许您将变量“强制”为其他类型。 In other words, the compiler cares about (1) the size, in memory, of the var (see 'pointer arithmetic' in K&R), and (2) what operations you can do on it. 换句话说,编译器关心(1)var的内存大小(参见K&R中的'指针算术'),以及(2)你可以对它做什么操作。

If memory serves me right, you can't do arithmetic on a char. 如果内存对我有用,你就不能对char进行算术运算。 But, if you call it an int, you can. 但是,如果你把它称为int,你可以。 So, to convert all LC letters to UC, you can do something like: 因此,要将所有LC字母转换为UC,您可以执行以下操作:

char letter;
....
if(letter-is-upper-case) {
    letter = (int) letter - 32;
}

Some (or most) C compilers would complain if you did not reinterpret the var as an int before adding/subtracting. 如果在添加/减去之前没有将var重新解释为int,那么一些(或大多数)C编译器会抱怨。

but, in the end, the type 'char' is just another term for int, really, since ASCII assigns a unique integer for each letter. 但是,最后,类型'char'只是int的另一个术语,实际上,因为ASCII为每个字母分配一个唯一的整数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM