Implementing concurrent_vector according to intel blog

Question

I am trying to implement a thread-safe lockless container, analogous to std::vector, according to this https://software.intel.com/en-us/blogs/2008/07/24/tbbconcurrent_vector-secrets-of-memory-organization

From what I understood, to prevent re-allocations and invalidating all iterators on all threads, instead of a single contiguous array, they add new contiguous blocks.
Each block they add is with a size of increasing powers of 2, so they can use log(index) to find the proper segment where an item at [index] is supposed to be.

From what I gather, they have a static array of pointers to segments, so they can quickly access them, however they don't know how many segments the user wants, so they made a small initial one and if the amount of segments exceeds the current count, they allocate a huge one and switch to using that one.

The problem is, adding a new segment can't be done in a lockless thread safe manner or at least I haven't figured out how. I can atomically increment the current size, but only that.
And also switching from the small to the large array of segment pointers involves a big allocation and memory copies, so I can't understand how they are doing it.

They have some code posted online, but all the important functions are without available source code, they are in their Thread Building Blocks DLL. Here is some code that demonstrates the issue:

template<typename T>
class concurrent_vector
{
    private:
        int size = 0;
        int lastSegmentIndex = 0;

        union
        {
            T* segmentsSmall[3];
            T** segmentsLarge;
        };

        void switch_to_large()
        {
            //Bunch of allocations, creates a T* segmentsLarge[32] basically and reassigns all old entries into it
        }

    public:
        concurrent_vector()
        {
            //The initial array is contiguous just for the sake of cache optimization
            T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
            segmentsSmall[0] = initialContiguousBlock;
            segmentsSmall[1] = initialContiguousBlock + 2;
            segmentsSmall[2] = initialContiguousBlock + 2 + 4;
        }

        void push_back(T& item)
        {
            if(size > 2 + 4 + 8)
            {
                switch_to_large(); //This is the problem part, there is no possible way to make this thread-safe without a mutex lock. I don't understand how Intel does it. It includes a bunch of allocations and memory copies.
            }

            InterlockedIncrement(&size); //Ok, so size is atomically increased

            //afterwards adds the item to the appropriate slot in the appropriate segment
        }
};

Answer 1

I would not try to make the segmentsLarge and segmentsSmall a union. Yes this wastes one more pointer. Then the pointer, lets call it just segments can initially point to segmentsSmall.

On the other hand the other methods can always use the same pointer which makes them simpler.

And switching from small to large can be accomplished by one compare exchange of a pointer.

I am not sure how this could be accomplished safely with a union.

The idea would look something like this (note that I used C++11, which the Intel library predates, so they likely did it with their atomic intrinsics). This probably misses quite a few details which I am sure the Intel people have thought more about, so you will likely have to check this against the implementations of all other methods.

#include <atomic>
#include <array>
#include <cstddef>
#include <climits>

template<typename T>
class concurrent_vector
{
private:
  std::atomic<size_t> size;
  std::atomic<T**> segments;
  std::array<T*, 3> segmentsSmall;
  unsigned lastSegmentIndex = 0;

  void switch_to_large()
  {
    T** segmentsOld = segments;
    if( segmentsOld == segmentsSmall.data()) {
      // not yet switched
      T** segmentsLarge = new T*[sizeof(size_t) * CHAR_BIT];
      // note that we leave the original segment allocations alone and just copy the pointers
      std::copy(segmentsSmall.begin(), segmentsSmall.end(), segmentsLarge);
      for(unsigned i = segmentsSmall.size(); i < numSegments; ++i) {
        segmentsLarge[i] = nullptr;
      }
      // now both the old and the new segments array are valid
      if( segments.compare_exchange_strong(segmentsOld, segmentsLarge)) {
        // success!
        return;
      }  else {
        // already switched, just clean up
        delete[] segmentsLarge;
      }
    }
  }

public:
  concurrent_vector()  : size(0), segments(segmentsSmall.data())
  {
    //The initial array is contiguous just for the sake of cache optimization
    T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
    segmentsSmall[0] = initialContiguousBlock;
    segmentsSmall[1] = initialContiguousBlock + 2;
    segmentsSmall[2] = initialContiguousBlock + 2 + 4;
  }

  void push_back(T& item)
  {
    if(size > 2 + 4 + 8) {
      switch_to_large();
    }
    // here we may have to allocate more segments atomically
    ++size;

    //afterwards adds the item to the appropriate slot in the appropriate segment
  }
};

Implementing concurrent_vector according to intel blog

Question

1 answers

solution1
3 ACCPTED 2017-10-19 13:45:57

Implementing concurrent_vector according to intel blog

Question

1 answers

solution1 3 ACCPTED 2017-10-19 13:45:57

solution1
3 ACCPTED 2017-10-19 13:45:57