简体   繁体   English

提升:多线程性能,线程/套接字的重用

[英]Boost: multithread performance, reuse of threads/sockets

I'll first describe my task and then present my questions below. 我将首先描述我的任务,然后在下面提出我的问题。

I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. 我正在尝试为我们的分布式DAQ系统实现“单线程一连接”方案。 I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform. 在Linux平台上,我将Boost用于线程(thread_group),将ASIO用于套接字。

We have 320 networked DAQ modules. 我们有320个联网的DAQ模块。 Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. 每0.25ms大约发送一次,其中大约一半将各自生成一个数据包(大小小于标准MTU)并发送到Linux服务器。 Each of the modules has its own long life TCP connection to its dedicated port on the server. 每个模块都有其到服务器上专用端口的长寿命TCP连接。 That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores . 也就是说,服务器端应用程序在1Gbe NIC和8个CPU内核上运行320个线程,320 tcp同步接收器

The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. 320个线程不必对传入的数据进行任何计算-只需接收数据,生成并添加时间戳并将数据存储在线程拥有的内存中即可。 The sockets are all syncronous, so that threads that have no incoming data are blocked. 套接字都是同步的,因此没有传入数据的线程将被阻塞。 Sockets are kept open for duration of a run. 套接字在运行期间保持打开状态。

Our requirement is that the threads should read their individual socket connections with as little time lag as possible . 我们的要求是线程应以尽可能少的时间延迟读取它们各自的套接字连接 Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second. 在阅读了有关C10K 和这篇文章之后, 我希望每个线程每秒都可以轻松地处理至少1K的MTU大小数据包。

My problem is this : I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). 我的问题是 :我首先通过在服务器上激发时间同步数据来测试系统(不同套接字上的传入数据相隔不到几微秒)。 When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. 当数据包的数量非常少(少于10个)时,我发现线程时间戳间隔了几微秒。 However, if more than 10 then the timestamps are spread by as much as 0.7sec. 但是,如果多于10个,则时间戳会传播多达0.7秒。

My questions are: 我的问题是:

  1. Have I totally misunderstood the C10K issue and messed up the implementation? 我是否完全误解了C10K问题并弄乱了实现方式? 320 does seems trivial compared to C10K 320确实比C10K显得微不足道
  2. Any hints as to whats going wrong? 关于出什么问题的任何提示吗?
  3. Could this really be a case for reuse of threads and/or sockets? 真的可以重用线程和/或套接字吗? (I really dont know how to implement reuse in my case, so any explanation is appreciated.) (我真的不知道如何在我的情况下实现重用,因此不胜感激。)

320 threads is chump change in terms of resources, but the scheduling may pose issues. 320个线程在资源方面是笨拙的变化,但是调度可能会带来问题。

320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread. 320 * 0.25 =每秒80个请求,这意味着至少有80个上下文切换,因为您决定必须在线程上建立每个连接。

I'd simply suggest: don't do this. 我只是建议:不要这样做。 It's well known that thread-per-connection doesn't scale. 众所周知,每个连接线程不会扩展。 And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless). 而且,几乎总是意味着在任何共享资源上进一步锁定争用(假设所有响应都不是完全无状态的)。


Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second 问:阅读了C10K和这篇文章后,我希望每个线程每秒都可以轻松地处理至少1K的MTU大小数据包

Yes. 是。 A single thread can easily sustain that (on most systems). 一个线程可以轻松地维持这一点(在大多数系统上)。 But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core. 但是 ,显然,如果您有数百个尝试相同的线程竞争物理内核,那就不再是事实。

So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores. 因此,要获得最大的吞吐量和较低的延迟,拥有可用线程(!)的物理核几乎没有用。


Q. Could this really be a case for reuse of threads and/or sockets? 问:这真的可以重用线程和/或套接字吗? (I really dont know how to implement reuse in my case, so any explanation is appreciated.) (我真的不知道如何在我的情况下实现重用,因此不胜感激。)

The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue. 好消息是,Boost Asio使使用单个线程(或有限的线程池)非常容易地从其服务队列中服务异步任务。

That is, assuming you did already use the *_async version of ASIO API functions. 也就是说,假设您已经使用了ASIO API函数的*_async版本。

I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only. 我认为绝大多数(如果不是全部)异步IO的Boost Asio示例说明了如何仅在有限数量的线程上运行服务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM