IPC bottleneck?

Question

I have two processes, a producer and a consumer. IPC is done with OpenFileMapping/MapViewOfFile on Win32.

The producer receives video from another source, which it then passes over to the consumer and synchronization is done through two events.

For the producer:

Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event

For the consumer

Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event

Without any of this, the video averages at 5fps. If I add the events on both sides, but without the CopyMemory, it's still around 5fps though a tiny bit slower. When I add the CopyMemory operation, it goes down to 2.5-2.8fps. Memcpy is even slower.

I find hard to believe that a simple memory copy can cause this kind of slowdown. Any ideas on a remedy?

Here's my code to create the shared mem:

HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);

The size is 1024 * 1024 * 3

Edit - added the actual code:

On the producer:

void OnFrameReceived(...)
{
    // get buffer
    BYTE *buffer = 0;
...

    // copy data to shared memory
    CopyMemory(((BYTE*)mapView) + 1, buffer, length);

    // signal data event
SetEvent(dataProducedEvent);

    // wait for it to be signaled back!
    WaitForSingleObject(dataConsumedEvent, INFINITE);
}

On the consumer:

while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
    {   
        SetEvent(dataConsumedEvent);
    }

Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. I tried using a Named Pipe to transfer the data over and guess what - the performance is restored.

Does anyone know of any reasons why this may be?

To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames.

Answer 1

Copying of memory involves certain operations under the hood, and for video this can be significant.

I'd try another route: create a shared block for each frame or several of frames. Name them consequently, ie block1, block2, block3 etc, so that the recipient knows what block to read next. Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. Once the consumer closes mapping of the block, this mapping is destroyed. So you get a stream of blocks and avoid blocking.

If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block).

Hope my idea is clear - avoid copying by using the block in producer, than in consumer

Answer 2

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount.

Regardless, it would seem that there is probably some optimisation that could occur to speed things up. Currently you seem to be either double or tripple handling the data. Double handling because you "recieve the frame" then "copy to shared memory". Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer.

The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. On the consumer side it would look something like this.

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

And the producer

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0.

IPC bottleneck?

Question

2 answers

solution1
1 2010-08-24 13:59:17

solution2
0 2010-08-24 14:04:37

IPC bottleneck?

Question

2 answers

solution1 1 2010-08-24 13:59:17

solution2 0 2010-08-24 14:04:37

solution1
1 2010-08-24 13:59:17

solution2
0 2010-08-24 14:04:37