简体   繁体   English

为什么没有“unsigned wchar_t”和“signed wchar_t”类型?

[英]Why there are no “unsigned wchar_t” and “signed wchar_t” types?

The signedness of char is not standardized. char的签名不是标准化的。 Hence there are signed char and unsigned char types. 因此,有signed charunsigned char类型。 Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this type was chosen to be int ), because if the argument type was char , we would get type conversion warnings from the compiler (if -Wconversion is used) in code like this: 因此,使用单个字符的函数必须使用可以包含signed char和unsigned char的参数类型(此类型被选择为int ),因为如果参数类型是char ,我们将从编译器获取类型转换警告(如果在这样的代码中使用-Wconversion):

char c = 'ÿ';
if (islower((unsigned char) c)) ...

warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result

( here we consider what would happen if the argument type of islower() was char ) 这里我们考虑如果islower()的参数类型为char会发生什么

And the thing which makes it work without explicit typecasting is automatic promotion from char to int . 而没有明确类型转换使其工作的事情是从charint自动升级。

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t . 此外,引入了wchar_t的ISO C90标准没有说明关于wchar_t表示的任何具体内容。

Some quotations from glibc reference: glibc引用的一些引用:

it would be legitimate to define wchar_t as char wchar_t定义为char是合理的

if wchar_t is defined as char the type wint_t must be defined as int due to the parameter promotion. 如果wchar_t定义为char则由于参数提升,类型wint_t必须定义为int

So, wchar_t can perfectly well be defined as char , which means that similar rules for wide character types must apply, ie, there may be implementations where wchar_t is positive, and there may be implementations where wchar_t is negative. 因此, wchar_t可以很好地定义为char ,这意味着必须应用类似宽字符类型的规则,即,可能存在wchar_t为正的实现,并且可能存在wchar_t为负的实现。 From this it follows that there must exist unsigned wchar_t and signed wchar_t types (for the same reason as there are unsigned char and signed char types). unsigned wchar_t ,必须存在unsigned wchar_tsigned wchar_t类型(出于与unsigned charsigned char类型相同的原因)。

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t ). 私有通信显示允许实现仅支持> = 0值的宽字符(与wchar_t的签名无关)。 Anybody knows what this means? 谁知道这意味着什么? Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character? 瘦是否意味着当wchar_t是16位类型(例如)时,我们只能使用15位来存储宽字符的值? In other words, is it true that a sign-extended wchar_t is a valid value? 换句话说,符号扩展的wchar_t是否为有效值? See also this question . 另见这个问题

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t . 此外,私人通信显示标准要求wchar_t任何有效值必须由wint_t表示。 Is it true? 这是真的吗?

Consider this example: 考虑这个例子:

#include <locale.h>
#include <ctype.h>
int main (void)
{
  setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");

  /* 11111111 */
  char c = 'ÿ';

  if (islower(c)) return 0;
  return 1;
}

To make it portable, we need the cast to '(unsigned char)'. 为了使它可移植,我们需要转换为'(unsigned char)'。 This is necessary because char may be the equivalent signed char , in which case a byte where the top bit is set would be sign extended when converting to int , yielding a value that is outside the range of unsigned char . 这是必要的,因为char可能是等效的signed char ,在这种情况下,设置顶部位的字节在转换为int时将被符号扩展,从而产生超出unsigned char范围的值。

Now, why is this scenario different from the following example for wide characters? 现在,为什么这种情况与宽字符的以下示例不同?

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc = L'ÿ';

  if (iswlower(wc)) return 0;
  return 1;
}

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type. 我们需要在这里使用iswlower((unsigned wchar_t)wc) ,但是没有unsigned wchar_t类型。

Why there are no unsigned wchar_t and signed wchar_t types? 为什么没有unsigned wchar_tsigned wchar_t类型?

UPDATE UPDATE

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? 标准是否保证在以下两个程序中转换为unsigned intint是正确的? (I just replaced wint_t and wchar_t to their actual meaning in glibc) (我只是将wint_twchar_t替换为glibc中的实际含义)

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  unsigned int wc;
  wc = getwchar();
  putwchar((int) wc);
}

-- -

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  int wc;
  wc = L'ÿ';
  if (iswlower((unsigned int) wc)) return 0;
  return 1;
}

TL;DR: TL; DR:

Why there are no unsigned wchar_t and signed wchar_t types? 为什么没有未签名的wchar_t和签名的wchar_t类型?

Because C's wide-character handling facilities were defined such that they are not needed. 因为C的宽字符处理设施被定义为不需要它们。


In more detail, 更详细的,

The signedness of char is not standardized. char的签名不是标准化的。

To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." 确切地说,“实现应该将char定义为具有与signed char或unsigned char相同的范围,表示和行为。” (C2011, 6.2.5/15) (C2011,6.2.5 / 15)

Hence there are signed char and unsigned char types. 因此,有signed charunsigned char类型。

"Hence" implies causation, which would be hard to argue clearly, but certainly signed char and unsigned char are more appropriate when you want to handle numbers, as opposed to characters. “因此”意味着因果关系,这很难说清楚,但当你想要处理数字而不是字符时,肯定signed charunsigned char更合适。

Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char 因此,使用单个字符的函数必须使用可以包含signed char和unsigned char的参数类型

No, not at all. 一点都不。 Standard library functions that work with individual characters could easily be defined in terms of type char , regardless of whether that type is signed, because the library implementation does know its signedness. 使用单个字符的标准库函数可以根据char类型轻松定义,无论该类型是否已签名,因为库实现确实知道其签名。 If that were a problem then it would apply equally to the string functions, too -- char would be useless. 如果这是一个问题,那么它也同样适用于字符串函数 - char将是无用的。

Your example of getchar() is non-apposite. 你的getchar()示例是不合适的。 It returns int rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. 它返回int而不是字符类型,因为它需要能够返回与任何字符都不对应的错误指示符。 Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int to unsigned char , but no conversion from char to unsigned char . 此外,您提供的代码与附带的警告消息不对应:它包含从intunsigned char转换,但没有从charunsigned char转换。

Some other character-handling functions accept int parameters or return values of type int both for compatibility with getchar() and other stdio functions, and for historic reasons. 其他一些字符处理函数接受int参数或返回int类型的值,以便与getchar()和其他stdio函数兼容,并且出于历史原因。 In days of yore, you couldn't actually pass a char at all -- it would always be promoted to int , and that is what the functions would (and must) accept. 在以前的日子里,你实际上根本无法传递一个char - 它总是被提升为int ,这就是函数将(并且必须)接受的东西。 One cannot later change the argument type, evolution of the language notwithstanding. 以后不能改变论证类型,语言的演变。

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t . 此外,引入了wchar_t的ISO C90标准没有说明关于wchar_t表示的任何具体内容。

C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t as C90不再具有真正的相关性,但毫无疑问它与C2011(7.19 / 2)非常类似,它将wchar_t描述为

an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...]. 一个整数类型,其值范围可以表示支持的语言环境[...]中指定的最大扩展字符集的所有成员的不同代码。

Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. 来自glibc引用的引用是非权威的,除了可能只对glibc。 They appear in any case to be commentary, not specification, and its unclear why you raise them. 它们在任何情况下都是评论,而不是规范,并且不清楚为什么你提出它们。 Certainly, though, at least the first is correct. 当然,至少第一个是正确的。 Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char then that implementation could define wchar_t as char . 参考标准,如果给定实现支持的语言环境中指定的最大扩展字符集的所有成员都可以放入char那么该实现可以将wchar_t定义为char Such implementations used to be much more common than they are today. 这种实现过去比现在更常见。

You ask several questions: 你问几个问题:

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t ). 私有通信显示允许实现仅支持> = 0值的宽字符(与wchar_t的签名无关)。 Anybody knows what this means? 谁知道这意味着什么?

I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. 我认为这意味着,与您沟通的人不会知道他们在谈论什么,或者他们所谈论的内容与C标准的要求不同。 You will find that in practice , character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard. 您会发现在实践中 ,字符集仅使用非负字符代码定义,但这不是C标准所放置的约束。

Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character? 瘦是否意味着当wchar_t是16位类型(例如)时,我们只能使用15位来存储宽字符的值?

The C standard does not say or imply that. C标准没有说或暗示。 You can store the value of any supported character in a wchar_t . 您可以将任何支持的字符的值存储在wchar_t In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t . 特别是,如果实现支持包含超过32767的字符代码的字符集,则可以将它们存储在wchar_t

In other words, is it true that a sign-extended wchar_t is a valid value? 换句话说,符号扩展的wchar_t是否为有效值?

The C standard does not say or imply that. C标准没有说或暗示。 It does not even say whether wchar_t is a signed type (if not, then sign extension is meaningless for it). 它甚至没有说wchar_t是否是带符号的类型(如果没有,那么符号扩展对它来说毫无意义)。 If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. 如果它是带符号的类型,则无法保证在某些受支持的字符集中对表示字符的值进行符号扩展(该值原则上可以为负值)将生成一个值,该值也表示该字符中的字符设置,或任何其他支持的字符集。 The same is true of adding 1 to a wchar_t value. 将1添加到wchar_t值也是如此。

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t . 此外,私人通信显示标准要求wchar_t任何有效值必须由wint_t表示。 Is it true? 这是真的吗?

It depends what you mean by "valid". 这取决于“有效”的含义。 The standard says that wint_t 标准说wint_t

is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set. 是默认参数提升的未更改的整数类型,它可以包含与扩展字符集的成员对应的任何值,以及至少一个与扩展字符集的任何成员不对应的值。

(C2011, 7.29.1/2) (C2011,7.29.1 / 2)

wchar_t must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wchar_t必须能够在任何受支持的语言环境中保存与扩展字符集的成员相对应的任何值。 wint_t must be able to hold all of those values, too. wint_t必须能够保存所有这些值。 It may be, however, that wchar_t is capable of representing values that do not correspond to any character in any supported character set. 但是, wchar_t可以表示与任何支持的字符集中的任何字符都不对应的值。 Such values are valid in the sense that the type can represent them. 这些值在类型可以表示它们的意义上是有效的。 wint_t is not required to be able to represent such values. wint_t不需要能够表示这样的值。

For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement wchar_t as an unsigned 16-bit integer, and wint_t as a signed 16-bit integer. 例如,如果任何支持的语言环境的最大扩展字符集使用的字符代码最多但不超过32767,那么实现可以自由地将wchar_t实现为无符号的16位整数,并将wint_t为带符号的16位整数。 The values representable by wchar_t that do not correspond to extended characters are then not representable by wint_t (but wint_t still has many candidates for its required value that does not correspond to any character). wchar_t表示的与扩展字符不对应的值因此不能由wint_t表示(但是wint_t仍然有许多候选者,其所需的值与任何字符都不对应)。

With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. 关于字符和宽字符分类功能,唯一的答案是差异仅仅来自不同的规范。 The char classification functions are defined to work with the same values that getchar() is defined to return -- either -1 or a character value converted, if necessary, to unsigned char . char分类函数被定义为使用与定义返回的getchar()相同的值 - -1或者必要时转换为unsigned char的字符值。 The wide character classification functions, on the other hand, accept arguments of type wint_t , which can represent the values of all wide-character unchanged, therefore there is no need for a conversion. 另一方面,宽字符分类函数接受wint_t类型的参数,它可以表示所有宽字符的值不变,因此不需要转换。

You claim in this regard that 你在这方面声称

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type. 我们需要在这里使用iswlower((unsigned wchar_t)wc) ,但是没有unsigned wchar_t类型。

No and maybe. 不,也许。 You do not need to convert the wchar_t argument to iswlower() to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. 您不需要将wchar_t参数转换为iswlower()到任何其他类型,特别是,您不需要将其转换为显式无符号类型。 The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. 广泛的字符分类功能与这方面的常规字符分类功能不同,它的设计是为了后见之明。 As for unsigned wchar_t , C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations. 对于unsigned wchar_t ,C不需要存在这样的类型,因此可移植代码不应该使用它,但它可能存在于某些实现中。


Regarding the update appended to the question: 关于问题的附加更新:

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? 标准是否保证在以下两个程序中转换为unsigned int和int是正确的? (I just replaced wint_t and wchar_t to their actual meaning in glibc) (我只是将wint_t和wchar_t替换为glibc中的实际含义)

The standard says nothing of the sort about conforming implementations in general. 该标准没有提到一般的符合实现的那种。 I'll suppose, however, that you mean to ask specifically about conforming implementations for which wchar_t is int and wint_t is unsigned int . 但是,我想你的意思是要具体询问符合wchar_tintwint_tunsigned int

On such an implementation, your first program is flawed because it does not account for the possibility that getwchar() returns WEOF . 在这样的实现,因为它没有考虑的可能性,你的第一个程序是有缺陷的getwchar()返回WEOF Converting WEOF to type wchar_t , if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. WEOF转换为类型wchar_t ,如果这样做不会导致信号被引发,则不能保证产生对应于任何宽字符的值。 Passing the result of such a conversion to putwchar() therefore does not exhibit defined behavior. 因此,将此类转换的结果传递给putwchar()不会显示已定义的行为。 Moreover, if WEOF is defined with the same value as UINT_MAX (which is not representable by int ) then the conversion of that value to int has implementation-defined behavior independently of the putwchar() call. 此外,如果WEOF与相同的值定义UINT_MAX (这是不被表示的int ),那么该值的转换int具有实现定义的行为独立于的putwchar()调用。

On the other hand, I think the key point you are struggling with is that if the value returned by getwchar() in the first program is not WEOF , then it is guaranteed to be one that is unchanged by conversion to wchar_t . 另一方面,我认为你正在努力的关键点是,如果getwchar()在第一个程序中返回的值不是WEOF ,那么它保证是通过转换为wchar_t而保持不变的值。 Your first program will perform as appears to be intended in that case, but the cast to int (or wchar_t ) is unnecessary. 您的第一个程序将在该情况下执行,但是转换为int (或wchar_t )是不必要的。

Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. 类似地,第二个程序是正确的,只要宽字符文字对应于适用的扩展字符集中的字符,但是转换是不必要的并且不做任何改变。 The wchar_t value of such a literal is guaranteed to be representable by type wint_t , so the cast changes the type of its operand, but not the value. 这种文字的wchar_t值保证可以通过类型wint_t表示,因此转换会更改其操作数的类型,但不会更改值。 (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.) (但是如果文字与扩展字符集中的字符不对应,则行为是实现定义的。)

On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this: 第三方面,如果您的目标是编写严格一致的代码,那么正确的事情,以及这些特定宽字符函数的预期使用模式,将是这样的:

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wint_t wc = getwchar();
  if (wc != WEOF) {
    // No cast is necessary or desirable
    putwchar(wc);
  }
}

and this: 还有这个:

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wchar_t wc = L'ÿ';
  // No cast is necessary or desirable
  if (iswlower(wc)) return 0;
  return 1;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM