简体   繁体   English

std :: bitset如何比std :: vector更快 <bool> ?

[英]How can std::bitset be faster than std::vector<bool>?

According to this answer the poster expects a std::bitset of size 100k bits to be faster than a std::vector<bool> when querying individual bits. 根据这个答案 ,发布者期望在查询单个位时,大小为100k位的std::bitsetstd::vector<bool>更快。 How can this be possible? 这怎么可能?

How could they even differ significantly in their implementation, if std::bitset apparently allows for arbitrary sizes just like std::vector ? 如果std::bitset显然允许与std::vector一样的任意大小,它们在实现上甚至会有很大不同吗?

Measurements on Visual Studio 2010 show that std::bitset is not generally faster than std::vector<bool> . 上的Visual Studio 2010显示,测量std::bitset 不是一般快于std::vector<bool> What the exact reason for this is I cannot say -- only that bitset is implemented significantly different from the std::vector full specialization. 我不能说出确切的原因是什么-只有实现的位集与std :: vector完全专业化有很大不同。

std::bitset stores it's full content in the object via a std :: bitset通过以下方式将其全部内容存储在对象中

template<size_t _Bits>
    class bitset .....

    _Ty _Array[_Words + 1]; // the set of bits
    };

array and that makes large bitset unsuitable to be put on the stack -- which isn't a performance argument per se. 数组,这使得大的位集不适合放在堆栈上,这本身不是性能参数。

vector<bool> doesn't suffer from the stack problem, and testing with a size of 1e6 and 1e7 it seems that on my box here querying values in a loop is actually 2x faster with a vector. vector<bool>不会受到堆栈问题的困扰,在大小为1e6和1e7的情况下进行测试似乎在我的方框中使用向量在循环中查询值实际上快了2倍。

Well. 好。 I guess the usual timing caveats apply and YMMV, but here's the test code I used should anyone care to try himself: 我想通常的时间警告和YMMV都适用,但是如果有人愿意尝试一下,这是我使用的测试代码:

The output on my box is: 我的盒子上的输出是:

1
vector<bool> loop with a size of 10000000 and 10 iterations*n: 11187 ms
bitset<10000000> loop with 10 iterations*n: 22719 ms
101250010
Press any key to continue . . .

BitMap.cpp 位图文件

#include "stdafx.h"
#include "BitMap.h"

using namespace std;

// Global var to prevent optimizer from messing things up
volatile size_t ext;

volatile clock_t t1;
volatile clock_t t2;
double delta1;
double delta2;

int main(int argc, _TCHAR* argv[])
{
  ext = 1;
  printf("%d\n", ext);

  vb_t *const vec = new vb_t(bssz);
  bs_t *const bits = new bs_t(); // must put large bitset on heap

  const int iter = 10;
  delta1=0;
  delta2=0;
  for(int o=0; o<5; ++o) {
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *vec);
    t2 = clock();
    delta1 += t2-t1;
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *bits);
    t2 = clock();
    delta2 += t2-t1;
  }

  delta1 /= CLOCKS_PER_SEC;
  delta2 /= CLOCKS_PER_SEC;
  delta1 *= 1000;
  delta2 *= 1000;

  cout << "vector<bool> loop with a size of " << bssz << " and " << iter << " iterations*n: " << delta1 << " ms\n";
  cout << "bitset<" << bssz << "> loop with " << iter << " iterations*n: " << delta2 << " ms\n";

  printf("%d\n", ext);
  delete vec;
  delete bits;
  return 0;
}

BitMap.h 位图

#pragma once
#include <vector>
#include <bitset>

extern volatile size_t ext;
const size_t bssz = size_t(1e7); // 1e7 ca 10m

using namespace std; // Test code, using here is OK.
typedef vector<bool> vb_t;
typedef bitset<bssz> bs_t;

template<class COLL>
void bs_loop(const int iterations, COLL const& v);

bs_loop.cpp bs_loop.cpp

#include "stdafx.h"
#include "BitMap.h"

template<class COLL>
void bs_loop(const int iterations, COLL const& v)
{
  ext = sizeof(COLL);
  for(size_t i=0; i!=iterations; ++i) {
    ++ext;
    for(size_t j=0, e=v.size(); j!=e; ++j) {
      if(v[j]) {
        --ext;
      }
      else {
        ++ext;
      }
    }
  }
}

template
void bs_loop(const int iterations, vb_t const& v);

template
void bs_loop(const int iterations, bs_t const& v);

Compiler command line: 编译器命令行:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /D "WIN32" /D "NDEBUG"
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy 
/fp:precise /Zc:wchar_t /Zc:forScope /Yu"StdAfx.h" /Fp"Release\BitMap.pch" 
/Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- 
/errorReport:queue 

note the /O2 and the missing /GL (no whole prg opt). 请注意/ O2和缺少的 / GL(没有整个程序选择)。

Well, since I'm the guy you're basing this question on, here's where I got that idea from : 好吧,因为我是您要问这个问题的人,所以我从这里得到了这个主意

"…it packs the bools and stores them as individual bits (inside, say, chars) in its internal representation. One consequence of this is that it can't just return a normal bool& from its operator[] or its dereferenced iterators[2]; instead, it has to play games with a helper "proxy" class that is bool-like but is definitely not a bool. Unfortunately, that also means that access into a vector<bool> is slower, because we have to deal with proxies instead of direct pointers and references. “……打包布尔并将其作为单个位(在内部,例如chars)存储在其内部表示中。其结果之一是,它不能仅从其operator []或其解引用的迭代器[2]返回普通的bool& ];相反,它必须使用类似于bool的辅助“ proxy”类玩游戏,但绝对不是bool。不幸的是,这也意味着进入vector<bool>的速度较慢,因为我们必须处理代理而不是直接的指针和引用。

Bottom line: If you care more about speed than you do about size, you shouldn't use std::vector<bool> . 底线:如果您更关心速度而不是大小,则不应使用std::vector<bool> Instead, you should hack around this optimization by using a std::vector<char> or the like instead, which is unfortunate but still the best you can do." 相反,您应该改用std::vector<char>或类似的方法来解决此优化问题,这很不幸,但仍然可以做到最好。

Or, as I recommended, if you know the biggest size that your set will get, use std::bitset . 或者,按照我的建议,如果您知道集合将获得的最大大小,请使用std::bitset

Honestly I think bitset it's better to use in the stack and not on the heap. 老实说,我认为最好在堆栈中而不是在堆上使用位集。 moreover the two are not in conflict with one to another because an elegant solution can be something like this: 此外,这两个彼此之间不会冲突,因为一种优雅的解决方案可以是这样的:

vector< bitset<64> > v(100000) //or whatever...

it can be interesting a test in comparison of this two instead: 对比一下这两个,可能会很有趣:

vector<unsigned char> v1(1000000) //8 bits to manage them manually
vector< bitset<8> >   v2(1000000) //8 bits managed by bitset

Moreover, just for adding to the answers here and remind how the compiler depend A LOT in the performance too, here's a simple test done with: 此外,只是为了增加此处的答案,并提醒编译器在性能上也如何依赖LOT,下面是一个简单的测试:

  • VS2012 VS2012
  • mingw/g++ 4.7.0 mingw / g ++ 4.7.0
  • g++ 4.8.2 on Ubuntu Ubuntu上的g ++ 4.8.2

(but all these tests are a little bit tricky and maybe give us only the rough general idea for a DIRECT comparison. Profiling the project it's the only thing in the end to do.) (但是所有这些测试都有些棘手,也许只能为我们提供直接比较的大致思路。对项目进行概要分析是最终唯一要做的事情。)

  • VS2012 compiled in Release (default Release provided). VS2012在Release中编译(提供了默认Release)。
  • g++ compiled with -O2 用-O2编译的g ++
  • g++ cimpiled with -O2 -std=c++11 用-O2 -std = c ++ 11封闭的g ++

NOTE: 注意:

with size 10^7: 大小为10 ^ 7:

  • VS2012 crash at runtime. VS2012在运行时崩溃。 (So I can assume have a memory management different from g++) (因此,我可以假定内存管理不同于g ++)
  • g++ it's ok. g ++没关系。
  • g++11 have a problem with test2() reporting time =0, I printed out some values just to trigger the code execution. g ++ 11的test2()报告时间= 0时有问题,我打印出一些值只是为了触发代码执行。 (I suppose it's a compiler optimization). (我想这是编译器优化)。

I included the overhead time of constructor and destructor of the objects too. 我也包括了对象的构造函数和析构函数的开销时间。

here the simple test code: 这里是简单的测试代码:

#include <iostream>
#include <vector>
#include <bitset>
#include <time.h>

using namespace std;

#define SIZE1 1000000000 //10e9
//#define SIZE2 10000000   //10e7 VS2012 crash at runtime, g++ OK
#define SIZE2 1000000 //10e6

void test1()
{
    register bool j;
    clock_t t1,t2;
    cout.precision(10);

    t1=clock();
    vector<bool> *v = new vector<bool>(SIZE1);
    for(register long int i=0; i<SIZE1;i++)
        (*v)[i] = i%2 == 0? true :false;

    for(register long int i=0; i<SIZE1;i++)
        j=(*v)[i];

    delete v;
    t2=clock();
    cout << "vector speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;

    t1=clock();
    bitset<SIZE1> *b = new bitset<SIZE1>();
    for(register long int i=0; i<SIZE1;i++)
        (*b)[i] = i%2 == 0? true :false;
    for(register long int i=0; i<SIZE1;i++)
        j=(*b)[i];

    delete b;
    t2=clock();
    cout << "bitset speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
}

void test2()
{
    register bool j;
    clock_t t1,t2;
    cout.precision(10);

    t1=clock();
    vector<bool> v(SIZE2);
    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            (v)[i] = i%2 == 0? true :false;

    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            j=(v)[i];

    t2=clock();
    cout << "vector speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
    cout << "v[1], v[2] " <<  (int) v[1] << ", "<< (int)v[2] << endl;

    t1=clock();
    bitset<SIZE2> b;
    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            (b)[i] = i%2 == 0? true :false;

    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            j=(b)[i];

    t2=clock();
    cout << "bitset speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
    cout << "b[1], b[2] " <<  (int) b[1] << ", "<< (int)b[2] << endl;
}


int main(int argc, char* argv[])
{
    test1();
    test2();


    return 0;
}

VS2012 output: VS2012输出:

vector speed = 3.105000019 (3105,0)
bitset speed = 10.44400024 (13551,3107)
vector speed = 3.987999916 (17542,13554)
v[1], v[2] 0, 1
bitset speed = 9.772999763 (27318,17545)
b[1], b[2] 0, 1

mingw/g++ output -O2: mingw / g ++输出-O2:

vector speed = 1.519 (1520,1)
bitset speed = 1.647 (3168,1521)
vector speed = 1.383999944 (4554,3170)
v[1], v[2] 0, 1
bitset speed = 1.610000014 (6166,4556)
b[1], b[2] 0, 1

mingw/g++ output -O2 -std=c++11: mingw / g ++输出-O2 -std = c ++ 11:

vector speed = 1.528 (1529,1)
bitset speed = 1.685 (3215,1530)
vector speed = 1.409999967 (4626,3216)
v[1], v[2] 0, 1
bitset speed = 1.763000011 (6392,4629)
b[1], b[2] 0, 1

g++ 4.8.2 output -O2: g ++ 4.8.2输出-O2:

vector speed = 1.561391 (1564139,2748)
bitset speed = 1.681818 (3246051,1564233)
vector speed = 1.487877011 (4733975,3246098)
v[1], v[2] 0, 1
bitset speed = 1.685297012 (6419328,4734031)
b[1], b[2] 0, 1

g++ 4.8.2 output -O2 -std=c++11: g ++ 4.8.2输出-O2 -std = c ++ 11:

vector speed = 1.561391 (1564139,2748)
bitset speed = 1.681818 (3246051,1564233)
vector speed = 1.487877011 (4733975,3246098)
v[1], v[2] 0, 1
bitset speed = 1.685297012 (6419328,4734031)
b[1], b[2] 0, 1

CONCLUSION: 结论:

As a rough idea vector seems faster for these use cases. 作为一个粗略的想法,对于这些用例来说,向量似乎更快。

I don't run multiple istances and average the results, but more or less the values are always the same. 我不会运行多个等式并对结果取平均,但是或多或少的值始终是相同的。

note on VS : I think it use a different memory management mechanism respect to gcc and for these use cases seems slower in the generated code. 关于VS的注意事项 :我认为它对gcc使用不同的内存管理机制,对于这些用例,生成的代码似乎较慢。

Here's my unscientific benchmark of accessing/inserting 3 billion elements from/into bitset<> and vector<bool> of sizes 100K, 1M and 5M. 这是我不科学的基准,它从大小为100K,1M和5M的bitset<>vector<bool>中访问/插入30亿个元素。 The compiler is GCC 4.8.2 on 64 bit Linux machine (Core i7): 编译器是64位Linux机器(Core i7)上的GCC 4.8.2:

With optimization (compiler flags: -O2 -std=c++11 ): 经过优化(编译器标志: -O2 -std=c++11 ):

[estan@pyret bitset_vs_vector]$ ./bitset_vs_vector 
bitset<100000> (3 billion accesses/inserts): 132.424 ms 
vector<bool>(100000) (3 billion accesses/inserts): 270.577 ms

bitset<1000000> (3 billion accesses/inserts): 67.752 ms 
vector<bool>(1000000) (3 billion accesses/inserts): 268.193 ms

bitset<5000000> (3 billion accesses/inserts): 67.426 ms 
vector<bool>(5000000) (3 billion accesses/inserts): 267.566 ms

Without optimization (compiler flags: -std=c++11 ): 没有优化(编译器标志: -std=c++11 ):

[estan@pyret bitset_vs_vector]$ make
g++ -std=c++11 -o bitset_vs_vector *.cpp
[estan@pyret bitset_vs_vector]$ ./bitset_vs_vector 
bitset<100000> (3 billion accesses/inserts): 1900.13 ms 
vector<bool>(100000) (3 billion accesses/inserts): 1784.76 ms

bitset<1000000> (3 billion accesses/inserts): 1825.09 ms 
vector<bool>(1000000) (3 billion accesses/inserts): 1768.03 ms

bitset<5000000> (3 billion accesses/inserts): 1846.73 ms 
vector<bool>(5000000) (3 billion accesses/inserts): 1763.48 ms

So it seems under these conditions, bitset is faster than vector when the code is optimized, while vector actually comes out on top by a (very) small margin when it's not. 因此,在这种情况下,似乎在优化代码时位集比矢量更快,而在没有优化时,矢量实际上排在最前面(很小)。

That said, if your code is time critical you should probably perform benchmarks yourself, since I suspect these numbers are highly compiler/environment specific. 就是说,如果您的代码对时间很紧迫,那么您可能应该自己执行基准测试,因为我怀疑这些数字与编译器/环境高度相关。

Benchmark code: 基准代码:

#include <iostream>
#include <functional>
#include <bitset>
#include <vector>
#include <ctime>

// Performs N access/insert on container.
template<class T>
void access_and_insert(T &container, int N)
{
    const std::size_t size = container.size();
    for (int i = 0; i < N; ++i) {
        bool v = container[i % size];
        container[i % size] = true;
    }
}

// Measure the time in milliseconds required to call f.
double measure(std::function<void (void)> f)
{
    clock_t start = std::clock();
    f();
    return 1000.0 * (std::clock() - start)/CLOCKS_PER_SEC;
}

int main (void)
{
    // Benchmark with 100K elements.
    std::bitset<100000> bitset100K;
    std::vector<bool> vector100K(100000);
    std::cout << "bitset<100000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset100K, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(100000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector100K, 3E7); }) << " ms" << std::endl;
    std::cout << std::endl;

    // Benchmark with 1M elements.
    std::bitset<1000000> bitset1M;
    std::vector<bool> vector1M(1000000);
    std::cout << "bitset<1000000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset1M, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(1000000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector1M, 3E7); }) << " ms" << std::endl;
    std::cout << std::endl;

    // Benchmark with 5M elements.
    std::bitset<5000000> bitset5M;
    std::vector<bool> vector5M(5000000);
    std::cout << "bitset<5000000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset5M, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(5000000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector5M, 3E7); }) << " ms" << std::endl;

    return 0;
}

the vector accesses its elements with iterators, which can't be a simple typedef for bool*, , which makes it slower than bitset, which doesn't provide iterators. 向量使用迭代器访问其元素,而迭代器对于bool *来说不是简单的typedef,这使其比不提供迭代器的bitset慢。 Another thing that makes it fast is that its size is known compile-time and therefore it does no allocation with new, which is slower than stack allocation. 使其快速的另一件事是它的大小在编译时是已知的,因此它不使用new进行分配,这比堆栈分配要慢。 Just random thoughts 只是随意的想法

另外,请注意vector<bool>是矢量模板的一种特殊形式,其实现方式与您可能想到的完全不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM