简体   繁体   English


[英]How to bitwise operate on memory block (C++)

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop?有没有比使用 for 循环更好(更快/更有效)的方法来对大内存块执行按位操作? After looking it to options I noticed that std has a member std::bitset , and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?在查看选项后,我注意到 std 有一个成员std::bitset ,并且还想知道是否将大内存区域转换为位集而不更改其值是否会更好(甚至可能),然后执行操作,然后将其类型切换回正常?

Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset .编辑/更新:我认为union可能在这里适用,这样内存块被分配一个newint数组或其他东西,然后作为一个大的bitset操作。 Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .根据这里所说的,操作似乎可以在整个集合上完成: http ://www.cplusplus.com/reference/bitset/bitset/operators/。

In general, there is no magical way faster than a for loop.一般来说,没有比 for 循环更快的神奇方法了。 However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:但是,您可以通过记住以下几点来使编译器更容易优化循环:

  1. Load the largest available integer type into memory at a time.一次将最大的可用整数类型加载到内存中。 However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.但是,如果缓冲区的长度不能除以该整数类型的大小,则需要小心。
  2. If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler.如果可能,在一个循环迭代中对多个值进行操作——这应该使编译器的向量化更加简单。 Again, you need to be careful about the buffer length.同样,您需要注意缓冲区长度。
  3. If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.如果要在较短的代码段上多次运行循环,请使用向下计数到零而不是向上计数的循环索引,并将其从数组长度中减去 - 这使得 CPU 的分支预测器更容易弄清楚发生了什么上。
  4. You can use explicit vector extensions provided by the compiler, but this will make your code less portable.您可以使用编译器提供的显式向量扩展,但这会降低您的代码的可移植性。
  5. Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.最终,您可以在汇编中编写循环并使用 CPU 提供的向量指令,但这完全不可移植。
  6. [edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory. [编辑] 此外,您可以使用 OpenMP 或类似的 API 在多个线程之间划分循环,但这只会在您对大量内存执行操作时产生改进。

C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar. C99 用常量字节异或内存的例子,假设 long long 是 128 位,缓冲区的开始对齐到 16 字节,不考虑第 3 点。两个内存缓冲区的按位运算非常相似。

size_t len = ...;
char *buffer = ...;

size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;

long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;

// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
    size_t j = loads_per_i*i;
    ptr[j  ] ^= xorvalue;
    ptr[j+1] ^= xorvalue;
    ptr[j+2] ^= xorvalue;
    ptr[j+3] ^= xorvalue;

// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
    ptr[i] ^= xorvalue;

// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
    buffer[i] ^= xorvalue;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM