使用位集进行位图存储有什么好处？

Question

I'm currently evaluating whether I should utilize a single large bitset or many 64-bit unsigned longs (uint_64) to store a large amount of bitmap information. 我目前正在评估是否应该使用单个大型bitset或许多64位无符号long（uint_64）来存储大量位图信息。 In this case, the bitmap represents the current status of a few GB of memory pages (dirty / not dirty), and has thousands of entries. 在这种情况下，位图表示几GB内存页面的当前状态（脏/非脏），并且有数千个条目。

The work which I am performing requires that I be able to query and update the dirty pages, including performing OR operations between two dirty page bitmaps. 我正在执行的工作要求我能够查询和更新脏页，包括在两个脏页位图之间执行OR操作。

To be clear, I will be performing the following: 为了清楚起见，我将执行以下操作：

Importing a bitmap from a file, and performing a bitwise OR operation with the existing bitmap 从文件导入位图，并使用现有位图执行按位OR运算
Computing the hamming weight (counting the number of bits set to 1, which represents the number of dirty pages) 计算汉明重量（计算设置为1的位数，表示脏页数）
Resetting / clearing a bit, to mark it as updated / clean 重置/清除一位，将其标记为已更新/清除
Checking the current status of a bit, to determine if it is clean 检查位的当前状态，以确定它是否干净

It looks like it is easy to perform bitwise operations on a C++ bitset, and easily compute the hamming weight. 看起来很容易在C ++ bitset上执行按位操作，并且很容易计算汉明重量。 However, I imagine there is no magic here -- the CPU can only perform bitwise operations on as many bytes as it can store in a register -- so the routine utilized by the bitset is likely the same I would implement myself. 但是，我想这里没有任何魔力 - CPU只能在可以存储在寄存器中的字节数上执行按位操作 - 因此bitset使用的例程可能与我自己实现的相同。 This is probably also true for the hamming weight. 汉明重量也可能是这样。

In addition, importing the bitmap data from the file to the bitset looks ugly -- I need to perform bitshifts multiple times, as shown here . 此外，从文件中位集导入位图数据看起来很难看-我需要执行bitshifts多次，如图所示这里。 I imagine given the size of the bitsets I would be working with, this would have a negative performance impact. 我想，考虑到我将使用的位集的大小，这会对性能产生负面影响。 Of course, I imagine I could just use many small bitsets instead, but there may be no advantage to this (other then perhaps ease of implementation). 当然，我想我可以只使用许多小的位集，但这可能没有优势（其他可能也很容易实现）。

Any advice is appriciated, as always. 一如既往，任何建议都是适当的。 Thanks! 谢谢！

Answer 1

Sounds like you have a very specific single-use application. 听起来你有一个非常具体的一次性应用程序。 Personally, I've never used a bitset, but from what I can tell its advantages are in being accessible as if it was an array of bools as well as being able to grow dynamically like a vector. 就我个人而言，我从来没有使用过bitset，但从我可以说它的优点是可以访问就好像它是一个bool数组，以及能够像向量一样动态增长。

From what I can gather, you don't really have a need for either of those. 从我可以收集到的，你真的不需要其中任何一个。 If that's the case and if populating the bitset is a drama, I would tend towards doing it myself, given that it really is quite simple to allocate a whole bunch of integers and do bit operations on them. 如果是这种情况，如果填充bitset是一个戏剧，我会倾向于自己做，因为分配一大堆整数并对它们进行位操作真的很简单。

Given that have very specific requirements, you will probably benefit from making your own optimizations. 鉴于具有非常具体的要求，您可能会从自己的优化中受益。 Having access to the raw bit data is kinda crucial for this (for example, using pre-calculated tables of hamming weights for a single byte, or even two bytes if you have memory to spare). 访问原始位数据对此至关重要（例如，使用预先计算的单个字节的汉明权重表，如果有备用内存，则使用两个字节）。

I don't generally advocate reinventing the wheel... But if you have special optimization requirements, it might be best to tailor your solution towards those. 我一般不主张重新发明轮子...但如果你有特殊的优化要求，最好定制你的解决方案。 In this case, the functionality you are implementing is pretty simple. 在这种情况下，您实现的功能非常简单。

Answer 2

I think if I were you I would probably just save myself the hassle of any DIY and use boost::dynamic_bitset . 我想如果我是你，我可能会省去任何DIY的麻烦并使用boost :: dynamic_bitset 。 They've got all the bases covered in terms of functionality, including stream operator overloads which you could use for file IO (or just read your data in as unsigned int s and use their conversions, see their examples) and a count method for your Hamming weight. 他们已经在功能方面涵盖了所有基础，包括流操作员重载，您可以将其用于文件IO（或者只是以unsigned int的形式读取数据并使用他们的转换，请参阅他们的示例）和count方法汉明重量。 Boost is very highly regarded a least by Sutter & Alexandrescu, and they do everything in the header file--no linking, just #include the appropriate files. Boost受到Sutter和Alexandrescu的高度重视，他们在头文件中做了所有事情 - 没有链接，只需#include适当的文件。 In addition, unlike the Standard Library bitset , you can wait until runtime to specify the size of the bitset. 此外，不同于标准库bitset ，你可以等到运行到指定位集的大小。

Edit: Boost does seem to allow for the fast input reading that you need. 编辑：Boost似乎允许您需要的快速输入读数。 dynamic_bitset supplies the following constructor: dynamic_bitset提供以下构造函数：

template <typename BlockInputIterator>
dynamic_bitset(BlockInputIterator first, BlockInputIterator last,
               const Allocator& alloc = Allocator());

The underlying storage is a std::vector (or something almost identical to it) of Block s, eg uint64 s. 底层存储是Block s的std::vector （或几乎与它相同的东西），例如uint64 。 So if you read in your bitmap as a std::vector of uint64 s, this constructor will write them directly into memory without any bitshifting. 因此，如果您将位图作为uint64的std::vector读入，则此构造函数会将它们直接写入内存而不进行任何位移。

Answer 3

Thousands bits does not sound as a lot. 成千上万的声音听起来并不多。 But maybe you have millions. 但也许你有数百万。

I suggest you write your code as-if you had the ideal implementation by abstracting it (to begin with use whatever implementation is easier to code, ignoring any performance and memory requirement problems) then try several alternative specific implementations to verify (by measuring them) which performs best. 我建议你编写你的代码 - 如果你有一个理想的实现通过抽象（开始使用任何实现更容易编码，忽略任何性能和内存需求问题）然后尝试几个替代的特定实现来验证（通过测量它们）表现最好。

One solution that you did not even consider is to use Judy arrays (specifically Judy1 arrays). 您甚至没有考虑过的一个解决方案是使用Judy数组（特别是Judy1数组）。

使用位集进行位图存储有什么好处？

问题描述

3 个解决方案

解决方案1
1 2012-09-18 02:11:54

解决方案2
1 2012-09-18 02:16:56

解决方案3
1 2012-09-18 03:04:16

使用位集进行位图存储有什么好处？

问题描述

3 个解决方案

解决方案1 1 2012-09-18 02:11:54

解决方案2 1 2012-09-18 02:16:56

解决方案3 1 2012-09-18 03:04:16

解决方案1
1 2012-09-18 02:11:54

解决方案2
1 2012-09-18 02:16:56

解决方案3
1 2012-09-18 03:04:16