简体繁体 English

关于将C中具有扩展符号的char转换为int

[英]About conversion of char to int in C with extending sign

原文 2012-11-10 00:35:21 9 2 c/ types/ integer/ char/ type-conversion

I was reading the book Programming in C by Stephen G. Kochan about C programming. 我正在阅读Stephen G. Kochan撰写的有关C编程的书《用C编程》。 It states that: 它指出：

"if a character value is used that is not part of the standard character, its sign might be extended when converted to an integer" “如果使用的字符值不属于标准字符，则当转换为整数时，其符号可能会扩展”

And then it states 然后它说

"C language permits character variable to be declared unsigned, this avoiding this potential problem" “ C语言允许将字符变量声明为无符号的，这避免了这种潜在的问题”

Can someone explain what problem may occur when extending the sign during conversion from char to int? 有人可以解释一下从char到int转换期间扩展符号时可能会发生什么问题吗？ And why does this matter? 为什么这很重要？ And what's wrong with an negative integer which is converted form a char? 转换为char的负整数怎么办？

Thank You 谢谢

2 个解决方案

Let's say you take an innocent looking function from <ctype.h>, isupper() . 假设您从<ctype.h> isupper()获取了一个看起来很纯真的函数。

It's defined int isupper(int c); 它的定义是int isupper(int c); . 。 So it takes an int and returns an int. 因此它需要一个int并返回一个int。

Now, let's say that you're not a very careful programmer, and you just pass you char to this function. 现在，假设您不是一个非常谨慎的程序员，您只是将char传递给此函数。 You think to yourself: "What could go wrong? This is the simplest function I know!". 您自己想：“可能出什么问题了？这是我所知道的最简单的功能！”。

But you'd be wrong. 但是你会错的。 Somewhere, someone will have her MP3 player going into an endless crash-loop because of this terrible mistake. 某个地方，由于这个可怕的错误，有人会让她的MP3播放器陷入无休止的崩溃循环。

And here's why. 这就是为什么。 The most annoying type in C is char. C语言中最烦人的类型是char。 It can be signed, it can be unsigned, you can force the compiler one way or another (but then you open another can of worms), and worst of all, the standard C library uses this type everywhere! 它可以是有符号的，也可以是无符号的，您可以以一种或另一种方式强制编译器（但是然后打开另一罐蠕虫），最糟糕的是，标准C库到处都使用这种类型！

So, you use char, but you're not aware of the fact that it's actually signed in your environment. 因此，您使用char，但是您不知道它实际上是在您的环境中签名的事实。 You use it as if the world is an ASCII world. 就像世界是ASCII世界一样使用它。

But the world isn't. 但是世界不是。 And that MP3 happy owner is now listening to a famous German song whose name contains the letter ä ("extended ASCII code 132"). 那个快乐的MP3所有者现在正在听一首著名的德国歌曲，其名字包含字母ä（“扩展的ASCII代码132”）。

You pass this character to isupper() , and the compiler does the following horror: "Ah, it's a character, but the function takes an integer. I know! I will not warn the programmer, because that's too simple. I'll just convert the character to an integer and pass it along. How do I do that? Let's check the C standard... Hmmm... Simple, just take the value and sign-extend it (because char is signed, don't you know?). Now, this character has the value -124, so I'll just convert it to an int with the value -124. That was simple, I don't see what the fuss is about. Why should I even warn the programmer?!" 您将此字符传递给isupper() ，编译器将产生以下恐怖：“啊，这是一个字符，但是函数需要一个整数。我知道！我不会警告程序员，因为那太简单了。我只是将字符转换为整数并将其传递。我该怎么做？让我们检查一下C标准...嗯...简单，只需要取值并将其符号扩展（因为char是带符号的，不是吗？知道吗？）现在，此字符的值为-124，所以我将其转换为值为-124的int。很简单，我看不到大惊小怪。我什至应该警告程序员？！”

And now isupper() is called with -124 instead of 132. 现在使用-124而不是132调用isupper() 。

But what's wrong with that? 但是那有什么问题呢？ Nothing, except that the C library that comes with the compiler implements isupper() using a simple 128-byte array: it simply returns the value at the given index. 没什么，除了编译器随附的C库使用简单的128字节数组实现isupper() ：它只是返回给定索引处的值。 The array is initialised with 0 everywhere except for upper-case ASCII codes, where it's 1. Such a simple and elegant implementation... 除了大写的ASCII码为1以外，数组在所有地方都初始化为0。如此简单而优雅的实现...

But wait, what happens if you pass a negative value to this function? 但是，等等，如果您向该函数传递一个负值会怎样？ Well, that's not allowed: 好吧，这是不允许的：

The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. c参数是一个整数，应用程序应确保该整数的值是一个可表示为无符号字符或等于宏EOF值的字符。 If the argument has any other value, the behavior is undefined. 如果参数具有任何其他值，则行为是未定义的。

So, undefined behaviour. 因此，不确定的行为。 In this case, it tries to access memory that doesn't belong to the process, and BAM! 在这种情况下，它将尝试访问不属于该进程的内存以及BAM！ the program crashes. 程序崩溃。

So you see, char is evil and you should never use it, unless you really understand how to use it properly. 因此，您会看到，char是邪恶的，除非您真的了解如何正确使用它，否则永远不要使用它。

(*) As Keith Thompson said in the comment, it is of course impossible to avoid using char . （*）正如Keith Thompson在评论中所说，避免使用char当然是不可能的。 From strlen() to curl_easy_escape() , everybody uses char . 从strlen()到curl_easy_escape() ，每个人都使用char 。 But you should be aware of conversions to int , especially when char may hold a negative number. 但是您应该注意到int的转换，特别是当char可能持有负数时。 <ctype.h> functions and array indices are two places where it's easy to make costly mistakes. <ctype.h>函数和数组索引是容易出错的两个地方。

In C, plain char can be either signed or unsigned and the choice is left to the implementation. 在C语言中，纯 char可以是有符号的也可以是无符号的，选择权留给实现。

From C99, 6.2.5, 7 : 从C99，6.2.5，7 ：

The three types char, signed char, and unsigned char are collectively called the character types. char，signed char和unsigned char这三种类型统称为字符类型。 The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char. 实现应将char定义为具有与签名char或未签名char相同的范围，表示形式和行为。

So when a character is assigned to integer, there comes the ambiguity whether the sign bit of char is set or not as it would affect the resulting value of integer to which the plain char was assigned. 因此，当将一个字符分配给整数时，char的符号位是否置位会产生歧义，因为这会影响将纯 char分配给的整数的结果值。

I believe, the quoted text from the book refers to this and using unsigned char explicitly avoids this problem. 我相信，书中引用的文字是针对此问题的，使用unsigned char显然可以避免此问题。