Cross Process Memory Barrier

Question

I'm using memory mapped files for cross process data sharing.

I have two processes, one that writes data chunks and one or several others that read these chunks. In order for the readers to know whether a chunk is ready I'm writing two "tag" values, one at the start and one at the end of each chunk to signal that it is ready.

It looks something like this:

NOTE: That in this example I don't include the fact that the reader processes can seek to previous chunks.

static const int32_t START_TAG = 0xFAFAFAFA;
static const int32_t END_TAG = 0x06060606;

void writer_process(int32_t* memory_mapped_file_ptr)
{
    auto ptr = memory_mapped_file_ptr;
    while (true)
    {
        std::vector<int32_t> chunk = generate_chunk();
        std::copy(ptr + 2, chunk.begin(), chunk.end());

        // We are done writing. Write the tags.

        *ptr = START_TAG;
        ptr += 1;
        *ptr = chunk.size();
        ptr += 1 + chunk.size();
        *ptr = END_TAG;
        ptr += 1;
    }   
}

void reader_process(int32_t* memory_mapped_file_ptr)
{
    auto ptr = memory_mapped_file_ptr;
    while (true)
    {
        auto ptr2 = ptr;

        std::this_thread::sleep_for(std::chrono::milliseconds(20));

        if (*ptr2 != START_TAG)
            continue;

        ptr2 += 1;

        auto len = *ptr2;
        ptr2 += 1;

        if (*(ptr2 + len) != END_TAG)
            continue;

        std::vector<int32_t> chunk(ptr2, ptr2 + len);

        process_chunk(chunk);
    }
}

This kind of works so far. But it looks to me like a very bad idea and could lead to all kinds of weird bugs due to cache behaviour.

Is there a better way to achieve this?

I've looked at:

message queues: Inefficient and only works with a single reader. Also I cannot seek to previous chunks.
mutexes: Not sure how to only lock for the current chunk instead of the entire memory. I can't have a mutex for every possible chunk (especially as they have dynamic size). I've considered partitioning the memory into blocks with one mutex each but that won't work for me due to the delay it incurs between writing and reading.

Answer 1

As mentioned by others, you need to have some kind of memory barrier to make sure that things are properly synchronized between multiple processors (and processes).

I would suggest you change your scheme with a header defining a set of currently available entries and using the interlock increment whenever a new entry becomes available.

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683614%28v=vs.85%29.aspx

The structure I would suggest is something like this so you can actually achieve what you want, and do it quickly:

// at the very start, the number of buffers you might have total
uint32_t   m_size;    // if you know the max. number maybe use a const instead...

// then m_size structures, one per buffer:
uint32_t   m_offset0;  // offset to your data
uint32_t   m_size0;    // size of that buffer
uint32_t   m_busy0;    // whether someone is working on the buffer
uint32_t   m_offset1;
uint32_t   m_size1;
uint32_t   m_busy1;
...
uint32_t   m_offsetN;
uint32_t   m_sizeN;
uint32_t   m_busyN;

With the offset and size you gain direct access to any buffer in your mapped area. To allocate a buffer, you probably want to implement something similar to what malloc() does, although all the necessary info is found in this table right here, so no need for chained lists, etc. However, if you are to free some buffers, you'll need to keep track of its size. And if you allocate/free all the time, you'll have fun with fragmentation. Anyway...

Another way is to make use of a ring buffer (a "pipe" in essence), so you always allocate after the last buffer and if not enough room there, allocate at the very start, closing N buffers as required by the new buffer size requirement... This would probably be easier to implement. However, that means you probably need to know where to start when looking for a buffer (ie have an index for what is currently considered the "first" [oldest] buffer, which will happen to be the next to be reused.)

But since you do not explain how a buffer becomes "old" and reusable (freed so it can be reused), I cannot really give you an exact implementation. But something like the following would probably do it for you.

In the header structure, if m_offset is zero, then the buffer is not currently allocated and thus there is nothing to do with that entry. If m_busy is zero, no process is accessing that buffer. I also present an m_free field which can be 0 or 1. The writer would set that parameter to 1 whenever it needs more buffers to save the data it just received. I don't go too deep with that one as, again, I do not exactly know how you free your buffers. It is not required if you never free the buffers also.

0) Structures

// only if the size varies between runs, otherwise use a constant like:
// namespace { uint32_t const COUNT = 123; }
struct header_count_t
{
    uint32_t    m_size;
};

struct header_t
{
    uint32_t    m_offset;
    uint32_t    m_size;
    uint32_t    m_busy;  // to use with Interlocked...() you may want to use LONG instead
};

// and from your "ptr" you'd do:
header_count_t *header_count = (header_count_t *) ptr;
header_count->m_size = ...; // your dynamic size (if dynamic it needs to be)
header_t *header = (header_t *) (header_count + 1);
// first buffer will be at: data = (char *) (header + header_count->m_size)
for(size_t n(0); n < header_count->m_size; ++n)
{
   // do work (see below) on header[n]
   ...
}

1) The writer to access the data must first lock the buffer, if not available, try again with the next one; locking is done with InterlockedIncrement() and unlocking with InterlockedDecrement() :

InterlockedIncrement(&header[n]->m_busy);
if(header[n]->m_offset == nullptr)
{
     // buffer not allocated yet, allocate now and copy data,
     // but do not save the offset until "much" later
     uint32_t offset = malloc_buffer();
     memcpy(ptr + offset, source_data, size);
     header[n]->m_size = size;

     // extra memory barrier to make sure that the data copied
     // in the buffer is all there before we save the offset
     InterlockedIncrement(&header[n]->m_busy);
     header[n]->m_offset = offset;
     InterlockedDecrement(&header[n]->m_busy);
}
InterlockedDecrement(&header[n]->m_busy);

Now this won't be enough if you want to be able to free a buffer. In that case, another flag is necessary to prevent other processes from reusing an old buffer. Again that will depend on your implementation... (see example below.)

2) A reader to access the data must first lock the buffer with an InterlockedIncrement() once done with the buffer, it needs to release the buffer with InterlockedDecrement() . Note that the lock applies even when the m_offset is a nullptr.

InterlockedIncrement(&header[n]->m_busy);
if(header[n]->m_offset)
{
    // do something with the buffer
    uint32_t size(header[n]->m_size);
    char const *buffer_ptr = ptr + header[n]->m_offset;
    ...
}
InterlockedDecrement(header[n]->m_busy);

So here I just test whether m_offset is set.

3) If you want to be able to free a buffer, you also need to test another flag (see below), if that other flag is true (or false) then the buffer is about to be freed (as soon as all processes released it) and that flag can then be used in the previous code snippet (ie either m_offset is zero, or that flag is 1 and the m_busy counter is exactly 1.)

Something like this for the writer:

LONG lock = InterlockedIncrement(&header[n]->m_busy);
if(header[n]->m_offset == nullptr
|| (lock == 1 && header[n]->m_free == 1))
{
    // new buffer (nullptr) or reusing an old buffer

    // reset the offset first
    InterlockedIncrement(&header[n]->m_busy);
    header[n]->m_offset = nullptr;
    InterlockedDecrement(&header[n]->m_busy);
    // then clear m_free
    header[n]->m_free = 0;
    InterlockedIncrement(&header[n]->m_busy);  // WARNING: you need another Decrement against this one...

    // code as before (malloc_buffer, memcpy, save size & offset...)
    ...
}
InterlockedDecrement(&header[n]->m_busy);

And in the reader the test changes with:

if(header[n]->m_offset && header[n]->m_free == 0)

As a side note: all the Interlocked...() functions are full memory barriers (fences) so you're all good in that regard. You have to use many of them to make sure that you get the right synching.

Note that this is untested code... but if you want to avoid inter-process semaphores (which would probably not simplify this much), that's the way to go. Note that the sleep() of 20ms in itself is not required, except to avoid one pegged CPU per reader, obviously.

Cross Process Memory Barrier

Question

1 answers

solution1
1 ACCPTED 2014-06-15 05:40:10

Cross Process Memory Barrier

Question

1 answers

solution1 1 ACCPTED 2014-06-15 05:40:10

solution1
1 ACCPTED 2014-06-15 05:40:10