简体   繁体   English

如何在C ++中使用无锁循环缓冲区实现零拷贝tcp

[英]How to implement zero-copy tcp using lock-free circular buffer in C++

I have multiple threads that need to consume data from a TCP stream. 我有多个线程需要使用TCP流中的数据。 I wish to use a circular buffer/queue in shared memory to read from the TCP socket. 我希望在共享内存中使用循环缓冲区/队列来从TCP套接字读取。 The TCP receive will write directly to the circular queue. TCP接收将直接写入循环队列。 The consumers will read from the queue. 消费者将从队列中读取。

This design should enable zero-copy and zero-lock. 此设计应启用零复制和零锁定。 However there are 2 different issues here. 但是这里有两个不同的问题。

  1. Is it possible/efficient to read just 1 logical message from the TCP socket? 从TCP套接字读取1条逻辑消息是否可能/有效? If not, and I read more than 1 message, I will have to copy the residuals from this to this->next. 如果没有,并且我阅读了多条消息,我将不得不将残差复制到此 - >下一步。

  2. Is it really possible to implement a lock-less queue? 是否真的可以实现无锁队列? I know there are atomic operations, but these can be costly too. 我知道有原子操作,但这些也很昂贵。 because all CPU cache needs to be invalidated. 因为所有CPU缓存都需要无效。 This will effect all operations on all of my 24 cores. 这将影响我所有24个内核的所有操作。

Im a little rusty in low-level TCP, and not exactly clear how to tell when a message is complete. 我在低级TCP中有点生疏,并且不清楚如何判断消息何时完成。 Do I look for \\0 or is it implementation specific? 我是在寻找\\ 0还是具体实现?

ty TY

Unfortunately, TCP cannot transfer messages, only byte streams. 不幸的是,TCP无法传输消息,只能传输字节流。 If you want to transfer messages, you will have to apply a protocol on top. 如果要传输邮件,则必须在顶部应用协议。 The best protocols for high performance are those that use a sanity-checkable header specifying the message length - this allows you to read the correct amount ot data directly into a suitable buffer object without iterating the data byte-by-byte looking for an end-of-message character. 高性能的最佳协议是使用指定消息长度的健全性检查标头的协议 - 这允许您将正确的数据量直接读取到合适的缓冲区对象中,而无需逐字节地迭代数据以寻找结束 - 消息字符。 The buffer POINTER can then be queued off to another thread and a new buffer object created/depooled for the next message. 然后,缓冲区POINTER可以排队到另一个线程,并为下一条消息创建/解释新的缓冲区对象。 This avoids any copying of bulk data and, for large messages, is sufficiently efficient that using a non-blocking queue for the message object pointers is somewhat pointless. 这避免了批量数据的任何复制,并且对于大型消息,足够有效,因为对消息对象指针使用非阻塞队列有点没有意义。

The next optimization avaialble is to pool the object *buffers to avoid continual new/dispose, recycling the *buffers in the consumer thread for re-use in the network receiving thread. 下一个优化avaialble是集合对象*缓冲区以避免连续的新/处置,回收消费者线程中的*缓冲区以便在网络接收线程中重用。 This is fairly easy to do with a ConcurrentQueue, preferably blocking to allow flow-control instead of data corruption or segfaults/AV if the pool empties temporarily. 这对于ConcurrentQueue来说相当容易,最好是阻塞以允许流控制而不是数据损坏或者如果池暂时清空,则会出现段错误/ AV。

Next, add a [cacheline size] 'dead-zone' at the start of each *buffer data member, so preventing any thread from false-sharing data with any other. 接下来,在每个*缓冲区数据成员的开头添加[cacheline size]'dead-zone',以防止任何线程与任何其他线程共享数据。

The result should be a high-bandwith flow of complete messages into the consumer thread with very little latency, CPU waste or cache-thrashing. 结果应该是一个高带宽的完整消息流进入消费者线程,只有很少的延迟,CPU浪费或缓存抖动。 All your 24 cores can run flat-out on different data. 所有24个内核都可以在不同的数据上运行平稳。

Copying bulk data in multithreaded apps is an admission of poor design and defeat. 在多线程应用程序中复制批量数据是对设计和失败的承认。

Follow up.. 跟进..

Sounds like you're stuck with iterating the data because of the different protocols:( 听起来你因为不同的协议而不得不重复数据:(

False-sharing-free PDU buffer object, example: 无错误的PDU缓冲对象,例如:

typedef struct{
  char deadZone[256];  // anti-false-sharing
  int dataLen;
  char data[8388608]; // 8 meg of data
} SbufferData;

class TdataBuffer: public{
private:
  TbufferPool *myPool; // reference to pool used, in case more than one
  EpduState PDUstate; // enum state variable used to decode protocol
protected:
  SbufferData netData;
public:
  virtual reInit(); // zeros dataLen, resets PDUstate etc. - call when depooling a buffer
  virtual int loadPDU(char *fromHere,int len);  // loads protocol unit
  release(); // pushes 'this' back onto 'myPool'
};

loadPDU gets passed a pointer to, length of, raw network data. loadPDU传递一个指向原始网络数据长度的指针。 It returns either 0 - means that it has not yet completely assembled a PDU, or the number of bytes it ate from the raw network data to completely assemble a PDU, in which case, queue it off, depool another one and call loadPDU() with the unused remainder of the raw data, then continue with the next raw data to come in. 它返回0 - 意味着它还没有完全组装PDU,或者它从原始网络数据中完全组装PDU的字节数,在这种情况下,将其排队,取消另一个并调用loadPDU()使用未使用的原始数据,然后继续使用下一个原始数据。

You can use different pools of different derived buffer classes to serve different protocols, if needed - an array of TbufferPool[Eprotocols]. 如果需要,您可以使用不同派生缓冲区类的不同池来提供不同的协议--TbufferPool [Eprotocols]数组。 TbufferPool could just be a BlockingCollection queue. TbufferPool可能只是一个BlockingCollection队列。 Management becomes almost trivial - the buffers can be sent on queues all round your system, to a GUI to display stats, then perhaps to a logger, as long as, at the end of the chain of queues, something calls release(). 管理几乎变得微不足道 - 缓冲区可以在整个系统的队列中发送到GUI以显示统计信息,然后可能发送到记录器,只要在队列链的末尾调用release()。

Obviously, a 'real' PDU object would have loads more methods, data unions/structs, iterators maybe and a state-engine to operate the protocol, but that's the basic idea anyway. 显然,“真正的”PDU对象可以加载更多方法,数据联合/结构,迭代器可能以及运行协议的状态引擎,但这无论如何都是基本的想法。 The main thing is easy management, encapsulation and, since no two threads can ever operate on the same buffer instance, no lock/synchro required to parse/access the data. 最重要的是易于管理,封装,并且由于没有两个线程可以在同一个缓冲区实例上运行,因此解析/访问数据不需要锁定/同步。

Oh, yes, and since no queue has to remain locked for longer than required to push/pop one pointer, the chances of actual contention are very low - even conventional blocking queues would hardly ever need to use kernel locking. 哦,是的,并且由于没有队列保持锁定的时间超过推送/弹出一个指针所需的时间,实际争用的可能性非常低 - 即使是传统的阻塞队列也几乎不需要使用内核锁定。

If you are using Windows 8 or Windows Server 2012, Registered I/O can be used which offers higher bandwidth for lower CPU than regular IOCP; 如果您使用的是Windows 8或Windows Server 2012,则可以使用已注册的I / O,它为低CPU提供比常规IOCP更高的带宽; it does this by cutting out kernel transitions, zero copy, amongst other things 它通过切断内核转换,零拷贝等来实现这一点

API: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740642%28v=vs.85%29.aspx API: http//msdn.microsoft.com/en-us/library/windows/desktop/ms740642%28v=vs.85%29.aspx

Background info: http://www.serverframework.com/asynchronousevents/rio/ 背景资料: http//www.serverframework.com/asynchronousevents/rio/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM