简体   繁体   English

Char与字节数组的unsigned char

[英]Char vs unsigned char for byte arrays

When storing "byte arrays" (blobs...) is it better to use char or unsigned char for the items ( unsigned char aka uint8_t )? 当存储“字节数组”(blobs ...)时,最好使用charunsigned char作为项目( unsigned char aka uint8_t )? (Standard says that sizeof of both is precisely 1 Byte.) (标准说sizeof两者恰恰是1个字节)。

Does it matter at all? 它有关系吗? Or one is more convenient or prevalent than the other? 或者一个比另一个更方便或更普遍? Maybe, what libraries like Boost do use? 也许,像Boost这样的图书馆会用到什么?

If char is signed, then performing arithmetic on a byte value with the high bit set will result in sign extension when promoting to int ; 如果char已签名,则对具有高位设置的字节值执行算术运算将在提升到int时导致符号扩展; so, for example: 所以,例如:

char c = '\xf0';
int res = (c << 24) | (c << 16) | (c << 8) | c;

will give 0xfffffff0 instead of 0xf0f0f0f0 . 将给出0xfffffff0而不是0xf0f0f0f0 This can be avoided by masking with 0xff . 通过使用0xff屏蔽可以避免这种情况。

char may still be preferable if you're interfacing with libraries that use it instead of unsigned char . 如果您与使用它而不是unsigned char库连接,那么char可能仍然是首选。

Note that a cast from char * to/from unsigned char * is always safe (3.9p2). 请注意,从char *到/来自unsigned char *的转换始终是安全的(3.9p2)。 A philosophical reason to favour unsigned char is that 3.9p4 in the standard favours it, at least for representing byte arrays that could hold memory representations of objects: 支持unsigned char哲学理由是标准中的3.9p4支持它,至少对于表示可以保存对象的内存表示的字节数组:

The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T , where N equals sizeof(T) . 类型的对象的对象表示 T是序列N unsigned char由类型的对象占据对象T ,其中N等于sizeof(T)

Theoretically, the size of a byte in C++ is dependant on the compiler-settings and target platform, but it is guaranteed to be at least 8 bits, which explains why sizeof(uint8_t) is required to be 1. 从理论上讲,C ++中字节的大小取决于编译器设置和目标平台,但保证至少为8位,这就解释了为什么sizeof(uint8_t)必须为1。

Here's more precisely what the standard has to say about it 更准确地说,标准对此有何看法

§1.71 §1.71

The fundamental storage unit in the C++ memory model is the byte. C ++内存模型中的基本存储单元是字节。 A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. 一个字节至少足以包含基本执行字符集(2.3)的任何成员和Unicode UTF-8编码形式的八位代码单元,并由连续的位序列组成,其数量为实现定义。 The least significant bit is called the low-order bit; 最低有效位称为低位; the most significant bit is called the high-order bit. 最重要的位称为高位。 The memory available to a C++ program consists of one or more sequences of contiguous bytes. C ++程序可用的内存由一个或多个连续字节序列组成。 Every byte has a unique address. 每个字节都有一个唯一的地址。

So, if you are working on some special hardware where bytes are not 8 bits, it may make a practical difference. 因此,如果您正在使用一些字节不是8位的特殊硬件,它可能会产生实际差异。 Otherwise, I'd say that it's a matter of taste and what information you want to communicate via the choice of type. 否则,我会说这是一个品味问题以及您希望通过类型选择进行沟通的信息。

尽管从可读性的角度来看,如果类型是unsigned char表示值为0..255,则更加清楚。

One of the other problems with potentially using a signed value for blobs is that the value will depend on the sign representation, which is not part of the standard. 可能对blob使用带符号值的其他问题之一是该值将取决于符号表示,而不是标准的一部分。 So, it's easier to invoke undefined behavior. 因此,调用未定义的行为更容易。

For example... 例如...

signed char x = 0x80;
int y = 0xffff00ff;

y |= (x << 8); // UB

The actual arithmetic value would also strictly depend two's complement, which may give some people surprises. 实际的算术值也严格依赖于两个补码,这可能会给一些人带来惊喜。 Using unsigned explicitly avoids these problems. 使用unsigned明确避免了这些问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM