高度并行的Apache Async HTTP客户端IOReactor问题

Question

Application description : 应用说明：

I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points 我正在使用由Comsat的Quasar FiberHttpClient（版本0.7.0）包装的Apache HTTP异步客户端（版本4.1.1），以便运行和执行一个高度并发的Java应用程序，该应用程序使用光纤在内部将HTTP请求发送到多个HTTP端点
The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way ) 该应用程序在tomcat之上运行（但是，光纤仅用于内部请求分派。tomcat servlet请求仍以标准阻塞方式处理）
Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it 每个外部请求在内部打开15-20根光纤，每个光纤建立一个HTTP请求并使用FiberHttpClient进行调度
I'm using a c44xlarge server ( 16 cores ) to test my application 我正在使用c44xlarge服务器（16个内核）来测试我的应用程序
The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. 我正在连接到抢占式保持活动连接的端点，这意味着如果我尝试通过重用套接字进行维护，则在请求执行尝试期间会关闭连接。 Therefor , I disable connection recycling. 因此，我禁用了连接回收。

According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ): 根据以上各节，这是我的光纤http客户端的调谐（当然，我正在使用的单个实例）：

 PoolingNHttpClientConnectionManager connectionManager = new PoolingNHttpClientConnectionManager( new DefaultConnectingIOReactor( IOReactorConfig. custom(). setIoThreadCount(16). setSoKeepAlive(false). setSoLinger(0). setSoReuseAddress(false). setSelectInterval(10). build() ) ); connectionManager.setDefaultMaxPerRoute(32768); connectionManager.setMaxTotal(131072); FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder. create(). setDefaultRequestConfig( RequestConfig. custom(). setSocketTimeout(1500). setConnectTimeout(1000). build() ). setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE). setConnectionManager(connectionManager). build();

ulimits for open-files are set super high ( 131072 for both soft and hard values ) 将打开文件的ulimit设置为超高（软值和硬值都为131072）
Eden is set for 18GB , Total heap size is 24GB Eden设置为18GB，总堆大小为24GB
OS Tcp stack is also well tuned : 操作系统Tcp堆栈也进行了很好的调整：

kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1 kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core .wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backets = 20000.net net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1

Problem description 问题描述

Under low-medium load all is well , connections are leased , cloesd and the pool replenishes 在中低负载下，一切都很好，连接被租用，克隆并补充池
Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying. 除了一些并发点，IOReactor线程（其中的16个）在死亡之前似乎已停止正常运行。
I've written a small thread to get the pool stats and print them each second. 我编写了一个小线程来获取池统计信息并每秒打印一次。 At around 25K leased connections , actual data is not sent anymore over the socket connections , The Pending stat clibms to a sky-rocketing 30K pending connection requests as well 在大约25K的租用连接中，实际数据不再通过套接字连接发送， Pending stat clibms也急剧上升到30K挂起的连接请求
This situation persists and basically renders the application useless. 这种情况持续存在，并且基本上使应用程序无用。 At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far 在某个时候，I / O Reactor线程死亡，不确定何时以及到目前为止我还无法捕获异常
lsof ing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them lsof荷兰国际集团的java程序，我可以看到它有成千上万的文件描述符，几乎所有的人都在CLOSE_WAIT（这是有道理的，因为I / O反应螺纹模/停止功能和永远不会真正关闭它们
During the time the application breaks, the server is not heavily overloaded/cpu stressed 在应用程序中断期间，服务器不会严重过载/ cpu紧张

Questions 问题

I'm guessing I am reaching some sort of boundary somewhere , though I'm rather clueless as to what or where it may reside. 我猜想我正在某个地方达到某种界限，尽管我对它可能驻留在什么地方还是什么地方一无所知。 Except from the following 除了以下
Is it possible I'm reaching an OS port ( all applicative requests are originating from a single internal IP after all) limits and creates an error that sends IO Reactor threads to die ( something similar to open files limit errors ) ? 我是否有可能达到OS端口（所有应用请求毕竟都来自单个内部IP）限制并创建一个错误，使IO Reactor线程死亡（类似于打开文件的限制错误）？

Answer 1

Forgot to answer this, but I got what's going on roughly a week after posting the question : 忘了回答这个问题，但是发布问题大约一周后，我得到了什么？

There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads. 某种错误配置导致io反应器仅产生2个线程。
Even after providing more reactor threads, the issue persisted. 即使在提供更多的反应堆线程之后，问题仍然存在。 It turns out that our outgoing requests were mostly SSL. 事实证明，我们的传出请求主要是SSL。 Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Apache SSL连接处理将核心处理传播到JVM的SSL设施，这些设施-效率不足以每秒处理数千个SSL连接请求。 Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. 更具体地说，SSLEngine内部的一些方法（如果我没记错的话）是同步的。 doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections. 在高负载下执行线程转储显示IORecator线程在尝试打开SSL连接时互相阻塞。
Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless. 甚至尝试以连接租用超时的形式创建泄压阀也不起作用，因为创建的积压订单过多，导致应用程序无用。
Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ). 将SSL传出请求处理工作卸载到nginx的情况甚至更糟-因为远程端点抢先终止了请求，所以无法使用SSL客户端会话缓存（JVM实现也是如此）。

Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue. 忍不住将信号灯放在整个模块的前面，在任何给定的时刻将整个事情限制在6000左右，这解决了问题。

高度并行的Apache Async HTTP客户端IOReactor问题

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-03-06 06:22:29

高度并行的Apache Async HTTP客户端IOReactor问题

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-03-06 06:22:29

解决方案1
1 已采纳 2017-03-06 06:22:29