简体   繁体   English

IPC瓶颈?

[英]IPC bottleneck?

I have two processes, a producer and a consumer. 我有两个过程,生产者和消费者。 IPC is done with OpenFileMapping/MapViewOfFile on Win32. IPC通过Win32上的OpenFileMapping / MapViewOfFile完成。

The producer receives video from another source, which it then passes over to the consumer and synchronization is done through two events. 制作人从另一个来源接收视频,然后将其传递给消费者,并通过两个事件完成同步。

For the producer: 对于生产者:

Receive frame
Copy to shared memory using CopyMemory
Trigger DataProduced event
Wait for DataConsumed event

For the consumer 对于消费者

Indefinitely wait for DataProducedEvent
Copy frame to own memory and send for processing
Signal DataConsumed event

Without any of this, the video averages at 5fps. 没有这些,视频平均为5fps。 If I add the events on both sides, but without the CopyMemory, it's still around 5fps though a tiny bit slower. 如果我在两面都添加事件,但没有CopyMemory,则它仍约为5fps,尽管速度稍慢。 When I add the CopyMemory operation, it goes down to 2.5-2.8fps. 当我添加CopyMemory操作时,它下降到2.5-2.8fps。 Memcpy is even slower. Memcpy甚至更慢。

I find hard to believe that a simple memory copy can cause this kind of slowdown. 我很难相信简单的内存副本会导致这种速度下降。 Any ideas on a remedy? 有什么补救办法吗?

Here's my code to create the shared mem: 这是我创建共享内存的代码:

HANDLE fileMap = CreateFileMapping(INVALID_HANDLE_VALUE, 0, PAGE_READWRITE, 0, fileMapSize, L"foomap");
void* mapView = MapViewOfFile(fileMap, FILE_MAP_WRITE | FILE_MAP_READ, 0, 0, fileMapSize);

The size is 1024 * 1024 * 3 尺寸为1024 * 1024 * 3

Edit - added the actual code: 编辑-添加了实际代码:

On the producer: 在生产者上:

void OnFrameReceived(...)
{
    // get buffer
    BYTE *buffer = 0;
...

    // copy data to shared memory
    CopyMemory(((BYTE*)mapView) + 1, buffer, length);

    // signal data event
SetEvent(dataProducedEvent);

    // wait for it to be signaled back!
    WaitForSingleObject(dataConsumedEvent, INFINITE);
}

On the consumer: 对消费者:

while(WAIT_OBJECT_0 == WaitForSingleObject(dataProducedEvent, INFINITE))
    {   
        SetEvent(dataConsumedEvent);
    }

Well, it seems that copying from the DirectShow buffer onto shared memory was the bottleneck after all. 好吧,似乎从DirectShow缓冲区复制到共享内存毕竟是瓶颈。 I tried using a Named Pipe to transfer the data over and guess what - the performance is restored. 我尝试使用命名管道来传输数据,然后猜测-恢复性能。

Does anyone know of any reasons why this may be? 有谁知道为什么会这样吗?

To add a detail that I didn't think was relevant before: the producer is injected and hooks onto a DirectShow graph to retrieve the frames. 要添加一个我以前认为不相关的细节:注入了生产者,并挂钩到DirectShow图上以检索框架。

Copying of memory involves certain operations under the hood, and for video this can be significant. 复制内存涉及幕后的某些操作,对于视频而言,这可能意义重大。

I'd try another route: create a shared block for each frame or several of frames. 我会尝试另一种方法:为每个帧或几个帧创建一个共享块。 Name them consequently, ie block1, block2, block3 etc, so that the recipient knows what block to read next. 因此,给它们命名,即block1,block2,block3等,以便接收者知道接下来要读取的块。 Now receive the frame directly to the allocated blockX, notify the consumer about availability of the new block and allocate and start using another block immediately. 现在直接将帧接收到分配的blockX,通知使用者新块的可用性,然后分配并立即开始使用另一个块。 Consumer maps the block and doesn't copy it - the block belongs to consumer now and consumer can use the original buffer in further processing. 消费者映射该块并且不复制它-该块现在属于消费者,并且消费者可以使用原始缓冲区进行进一步处理。 Once the consumer closes mapping of the block, this mapping is destroyed. 使用者关闭该块的映射后,该映射将被销毁。 So you get a stream of blocks and avoid blocking. 这样一来,您就会获得大量的障碍,并避免阻塞。

If frame processing doesn't take much time and creation of shared block does, you can create a pool of shared blocks, large enough to ensure that producer and consumer never attempt to use the same block (you can complicate scheme by using a semaphore or mutx to guard each block). 如果帧处理花费的时间不多,而创建共享块所需的时间很长,则可以创建一个共享块池,该池足够大,以确保生产者和使用者不会尝试使用同一块(您可以通过使用信号量或互斥体来保护每个块)。

Hope my idea is clear - avoid copying by using the block in producer, than in consumer 希望我的想法很明确-避免在生产者而不是消费者中使用块复制

The time it takes to copy 3MB of memory really shouldn't be at all noticeable. 复制3MB内存所花费的时间根本不应该引起注意。 A quick test on my old (and busted) laptop was able to complete 10,000 memcpy(buf1, buf2, 1024 * 1024 * 3) operations in around 10 seconds. 在我的旧笔记本电脑上进行的快速测试能够在大约10秒内完成10,000个memcpy(buf1, buf2, 1024 * 1024 * 3)操作。 At 1/1000th of a second it shouldn't be slowing down your frame rate by a noticeable amount. 在1/1000秒的时间内,它不应将帧速率减慢很多。

Regardless, it would seem that there is probably some optimisation that could occur to speed things up. 无论如何,似乎可能会有一些优化可以加快速度。 Currently you seem to be either double or tripple handling the data. 当前,您似乎是处理数据的两倍或三倍。 Double handling because you "recieve the frame" then "copy to shared memory". 重复处理,因为您先“接收到框架”,然后“复制到共享内存”。 Triple handling if "Copy frame to own memory and send for processing" means that you truly copy to a local buffer and then process instead of just processing from the buffer. 如果“将帧复制到自己的内存并发送以进行处理”,则意味着需要进行三重处理,这意味着您可以真正复制到本地缓冲区,然后进行处理,而不仅仅是从缓冲区中进行处理。

The alternative is to receive the frame into the shared buffer directly and process it directly out of the buffer. 另一种选择是将帧直接接收到共享缓冲区中,然后直接从缓冲区中处理它。 If, as I suspect, you want to be able to receive one frame while processing another you just increase the size of the memory mapping to accomodate more than one frame and use it as a circular array. 如我所怀疑的,如果您希望能够在处理另一帧的同时接收一帧,则只需增加内存映射的大小以容纳一个以上的帧并将其用作循环数组即可。 On the consumer side it would look something like this. 在消费者方面,它将看起来像这样。

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data produced event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
processFrame(frame);
ReleaseSemaphore(...)     // Generate data consumed event

And the producer 和生产者

char *sharedMemory;
int frameNumber = 0;
...
WaitForSingleObject(...)  // Consume data consumed event
frame = &sharedMemory[FRAME_SIZE * (frameNumber % FRAMES_IN_ARRAY_COUNT)
recieveFrame(frame);
ReleaseSemaphore(...)     // Generate data produced event

Just make sure that the semaphore the data consumed semaphore is initialised to FRAME_IN_ARRAY_COUNT and the data produced semaphore is initialised to 0. 只需确保将已消耗数据的信号量的信号量初始化为FRAME_IN_ARRAY_COUNT并将产生的数据量的信号量初始化为0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM