简体   繁体   English

位向量运算和Endianess

[英]Bit vector operations and Endianess

I do a lot of bit vector operations in my software. 我在软件中做了很多位向量运算。 For example: suppose I need to store boolean information about a candidate 'n', I do the following: 例如:假设我需要存储有关候选'n'的布尔信息,请执行以下操作:

uint64_t *information_vector;
uint32_t pos = n / 64;
uint32_t bit_pos = n % 64;

information_vector[pos] |= (1 << bit_pos);

and I follow similar procedure while reading that information: 在阅读该信息时,我遵循类似的步骤:

uint32_t pos = n / 64;
uint32_t bit_pos = n % 64;
if (information_vector[pos] & (1 << bit_pos)) {
       // do something
}

In the meantime, I also write the information_vector to the disk and read it back again. 同时,我还将information_vector写入磁盘并再次读回。 Now, I am trying to solve a bug which is giving me nightmares and it struck me that Endianess might be a culprit here but I can not explain. 现在,我正在尝试解决一个给我带来噩梦的错误,这让我震惊,Endianess可能是这里的罪魁祸首,但我无法解释。 Is there any way I can check? 有什么办法可以检查吗? Is this bit vector manipulation generally endian safe and across architectures? 这种位向量操作通常在字节顺序上安全且在整个体系结构中是否安全?

I also see that somewhere in the code I set some other information in another bit vector for the same candidate as: 我还看到在代码的某处,我在同一位候选人的另一个位向量中设置了一些其他信息,例如:

uint8_t byte_position = n / 8;
uint8_t bit_position = n % 8;
another_information_vector[byte_position] |= (1 << bit_position);

I usually find common set of attributes by and-ing these bit vectors. 我通常通过将这些位向量相加来找到通用的属性集。

Generally speaking, if you always access your bit vector using the same type (in your case uint64_t ), and the endian-ness of all systems on which you access the data is the same, then Endian-ness will not become a problem. 一般而言,如果您始终使用相同的类型(在您的情况下为uint64_t )访问位向量,并且访问数据的所有系统的字节序相同,则字节序不会成为问题。

The easiest way to reassure yourself though, is to cast the address of the object to char* and dereference, which will let you see one byte at a time in the order they are laid out in memory. 不过,让自己放心的最简单方法是将对象的地址强制转换为char*和取消引用,这将使您一次可以按它们在内存中的排列顺序看到一个字节。

Update: I just observed that your third block of code seems to compute byte_position by doing n % 8 . 更新:我刚刚观察到您的第三段代码似乎通过执行n % 8来计算byte_position

If you are sometimes writing out an array of uint64_t , and sometimes treating it as an array of uint8_t , then your results will probably be unexpected if your system is little endian. 如果您有时会写出uint64_t数组,有时又将其视为uint8_t数组,那么如果您的系统为低端字节序,则结果可能出乎意料。

The best way to avoid this problem is to keep your types consistent. 避免此问题的最佳方法是保持类型一致。

To make this problem more concrete, consider the following example:

#include <stdio.h>
#include <stdint.h>

int main(){
    uint64_t myVector = 1 << 2; // set second bit of LSB
    uint8_t * ptr = (uint8_t *) &myVector;
    int i;
    for (i = 0; i < 8; i++)
       printf("%x\n", ptr[i]);
}

On my little-endian x86 system, this will print 4 followed by 7 0 's, because the Most Significant Byte is stored at the address at the highest address in the uint64_t . 在我的小端x86系统上,这将打印4后跟7 0 ,因为最高有效字节存储在uint64_t最高地址处的地址。 This might run counter to your intuition, if you are used to thinking of the bits laid out from Most Significant to Least Significant, left to right. 如果您习惯于从最高有效到最低有效(从左到右)进行排列,这可能与您的直觉背道而驰。

This is certainly endian safe across architectures within CPU. 在CPU内的各种体系结构中,这肯定是字节顺序安全的。 Writing to disk from one architecture and then reading it back on a different architecture will depend on how you are reading and writing it to disk. 从一种体系结构写入磁盘,然后在另一种体系结构上读回该磁盘,将取决于您如何在磁盘上进行读写。 This is no different than the problems that you would have in writing any multi-byte number to disk and reading it back. 这与将任何多字节数字写入磁盘并读回该磁盘时所遇到的问题没有什么不同。 Both ends have to interpret that number the same. 两端必须解释相同的数字。 If in this example you are just writing the 8 bytes to disk and then reading them on a different endian architecture, then you are going to have the bytes swapped. 如果在此示例中,您只是将8个字节写入磁盘,然后在不同的字节序体系结构上读取它们,则将要交换字节。

For most cases, the safest variant is to operate on byte level, so, divisor is 8. OTOH it can be suboptimal in some cases. 对于大多数情况,最安全的变体是在字节级别上进行操作,因此,除数为8。OTOH在某些情况下可能不理想。 There are architectures without direct access to a byte, or with expensive access, compared with a word access. 与字访问相比,有些体系结构无法直接访问字节,或者具有昂贵的访问权限。

On a little-endian machine, the same approach works unchanged when selecting any reasonable divisor (8, 16, 32, 64). 在小字节序机器上,选择任何合理的除数(8、16、32、64)时,相同的方法不会改变。 For example, for bit index 22, byte-level access deals with bit numbered 6 of the byte with index 2; 例如,对于位索引22,字节级访问处理索引为2的字节的位6。 short-word access deals with bit 6 of short-word with 1; 短字访问以1来处理短字的第6位; and so forth. 等等。

On a big-endian machine, this needs replacing of 1 << bit_position with 1 << (BITS_PER_CELL-1-bit_position) , or (the same) HIGHEST_BIT >> bit_position , where HIGHEST_BIT is 0x80 for uint8_t, 0x80000000 for uin32_t, etc. And, bit index 0 will mean MSB of byte 0, as opposed to little-endian case where it means LSB of byte 0. 在大端机上,这需要用1 << (BITS_PER_CELL-1-bit_position)或(相同) HIGHEST_BIT >> bit_position替换1 << bit_position HIGHEST_BIT >> bit_position ,其中对于uint8_t,HIGHEST_BIT是0x80;对于HIGHEST_BIT >> bit_position ,HIGHEST_BIT是0x80000000,以此类推。并且,位索引0表示字节0的MSB,与小字节顺序表示字节0的LSB的情况相反。

(A similar effect can be seen on serial wires. In RS232 or Ethernet, bytes are transmitted from LSB to MSB. The individual/group bit in MAC address is the very first one on the wire but it's LSB of the first octet.) (在串行线路上可以看到类似的效果。在RS232或以太网中,字节是从LSB传输到MSB的。MAC地址中的单个/组位是线路上的第一个,但它是第一个八位位组的LSB。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM