Infinispan 9.4.16、JBoss EAP 7.3 与复制缓存 2 节点线程的锁争用是 TIMED_WAITING（停车）

Question

I have an application that is currently relying on an infinispan replicated-cache to share a work queue across all nodes.我有一个应用程序当前依赖 infinispan 复制缓存在所有节点之间共享一个工作队列。 The queue is pretty standard, with head, tail, and size pointers persisting in the infinispan map.队列非常标准，头、尾和大小指针持续存在于 infinispan map 中。

We've upgraded from Infinispan 7.2.5 to 9.4.16 and notice that the lock performance is much worse than before.我们已经从 Infinispan 7.2.5 升级到 9.4.16 并且注意到锁定性能比以前差了很多。 I've managed to get the thread dumps from the 2 nodes while they were both trying to initialize the queue at the same time.当他们都试图同时初始化队列时，我已经设法从 2 个节点获取线程转储。 With Infinispan 7.2.5 lock and sync performance was very good with no issues. Infinispan 7.2.5 的锁定和同步性能非常好，没有任何问题。 Now we see lock timeouts and way more failures.现在我们看到了锁超时和更多的失败。

Node #1 partial stack trace from thread dump 2021-04-20 13:45:13:来自线程转储 2021-04-20 13:45:13 的节点 #1 部分堆栈跟踪：

"default task-2" #600 prio=5 os_prio=0 tid=0x000000000c559000 nid=0x1f8a waiting on condition [0x00007f4df3f72000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006e1f4fec0> (a java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
    at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
    at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
    at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
    at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
    at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)

Node#2 partial stack trace from thread dump: 2021-04-20 13:45:04:来自线程转储的 Node#2 部分堆栈跟踪：2021-04-20 13:45:04：

"default task-2" #684 prio=5 os_prio=0 tid=0x0000000011f26000 nid=0x3c60 waiting on condition [0x00007f55107e4000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0000000746bd36d8> (a java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
    at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:105)
    at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:38)
    at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1077)
    at org.infinispan.cache.impl.CacheImpl.lock(CacheImpl.java:1057)
    at org.infinispan.cache.impl.AbstractDelegatingAdvancedCache.lock(AbstractDelegatingAdvancedCache.java:286)
    at org.infinispan.cache.impl.EncoderCache.lock(EncoderCache.java:318)
    at com.siperian.mrm.match.InfinispanQueue.initialize(InfinispanQueue.java:88)

Client side error that pops up on the console of the machine running Node #1:在运行节点 #1 的机器的控制台上弹出客户端错误：

2021-04-20 13:45:49,069 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (jgroups-15,infinispan-cleanse-cluster_192.168.0.24_cmx_system105,N1618938080334-63633(machine-id=M1618938080334)) ISPN000136: Error executing command LockControlCommand on Cache 'orclmdm-MDM_SAMPLE105/FUZZY_MATCH', writing keys []: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 60 seconds for key QUEUE_TAIL_C_PARTY and requestor GlobalTx:N1618938080334-63633(machine-id=M1618938080334):429. Lock is held by GlobalTx:N1618938062946-60114(machine-id=M1618938062946):420
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.get(DefaultLockManager.java:288)
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$KeyAwareExtendedLockPromise.lock(DefaultLockManager.java:261)
    at org.infinispan.util.concurrent.locks.impl.DefaultLockManager$CompositeLockPromise.lock(DefaultLockManager.java:348)
    at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.localLockCommandWork(PessimisticLockingInterceptor.java:208)
    at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.lambda$new$0(PessimisticLockingInterceptor.java:46)
    at org.infinispan.interceptors.InvocationSuccessFunction.apply(InvocationSuccessFunction.java:25)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.invokeQueuedHandlers(QueueAsyncInvocationStage.java:118)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:81)
    at org.infinispan.interceptors.impl.QueueAsyncInvocationStage.accept(QueueAsyncInvocationStage.java:30)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
    at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
    at org.infinispan.remoting.transport.AbstractRequest.complete(AbstractRequest.java:67)
    at org.infinispan.remoting.transport.impl.MultiTargetRequest.onResponse(MultiTargetRequest.java:102)
    at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1369)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1272)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1417)
    at org.jgroups.JChannel.up(JChannel.java:816)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:900)
    at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:128)
    at org.jgroups.protocols.RSVP.up(RSVP.java:163)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:177)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
    at org.jgroups.protocols.pbcast.GMS.up(GMS.java:872)
    at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1008)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:734)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:389)
    at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:590)
    at org.jgroups.protocols.BARRIER.up(BARRIER.java:171)
    at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:131)
    at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203)
    at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253)
    at org.jgroups.protocols.MERGE3.up(MERGE3.java:280)
    at org.jgroups.protocols.Discovery.up(Discovery.java:295)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1250)
    at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Infinispan Config: Infinispan 配置：

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:infinispan:config:9.4 http://www.infinispan.org/schemas/infinispan-config-9.4.xsd"
        xmlns="urn:infinispan:config:9.4">    

    <jgroups>
        <stack-file name="mdmudp" path="$cmx.home$/jgroups-udp.xml" />
        <stack-file name="mdmtcp" path="$cmx.home$/jgroups-tcp.xml" />
    </jgroups>

    <cache-container name="MDMCacheManager" statistics="true"
        shutdown-hook="DEFAULT">
        <transport stack="mdmudp" cluster="infinispan-cluster"
            node-name="$node$" machine="$machine$" />

        <jmx domain="org.infinispan.mdm.hub"/>  

        <replicated-cache name="FUZZY_MATCH" statistics="true" unreliable-return-values="false">
            <locking isolation="READ_COMMITTED" acquire-timeout="60000"
                concurrency-level="5000" striping="false" />
            <transaction
                transaction-manager-lookup="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"
                stop-timeout="30000" auto-commit="true" locking="PESSIMISTIC"
                mode="NON_XA" notifications="true" />
        </replicated-cache>

    </cache-container>
</infinispan>

We are using udp multicast on default, here's the udp config:我们默认使用 udp 多播，这是 udp 配置：

<!--
  Default stack using IP multicasting. It is similar to the "udp"
  stack in stacks.xml, but doesn't use streaming state transfer and flushing
  author: Bela Ban
-->

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups.xsd">
    <UDP
         mcast_port="${jgroups.udp.mcast_port:46688}"
         ip_ttl="4"
         tos="8"
         ucast_recv_buf_size="5M"
         ucast_send_buf_size="5M"
         mcast_recv_buf_size="5M"
         mcast_send_buf_size="5M"
         max_bundle_size="64K"
         enable_diagnostics="true"
         thread_naming_pattern="cl"

         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="8"
         thread_pool.keep_alive_time="5000"/>

    <PING />
    <MERGE3 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD_ALL/>
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK2 xmit_interval="500"
                    xmit_table_num_rows="100"
                    xmit_table_msgs_per_row="2000"
                    xmit_table_max_compaction_time="30000"
                    use_mcast_xmit="false"
                    discard_delivered_msgs="true"/>
    <UNICAST3 xmit_interval="500"
              xmit_table_num_rows="100"
              xmit_table_msgs_per_row="2000"
              xmit_table_max_compaction_time="60000"
              conn_expiry_timeout="0"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="4M"/>
    <pbcast.GMS print_local_addr="true" join_timeout="2000"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
    <RSVP resend_interval="2000" timeout="10000"/>
    <pbcast.STATE_TRANSFER />
    <!-- pbcast.FLUSH  /-->
</config>

Any ideas on configuration would be great.任何关于配置的想法都会很棒。 What happens is that both nodes timeout and the queue doesn't initialize properly (null keys).发生的情况是两个节点都超时并且队列没有正确初始化（空键）。 Thanks in advance.提前致谢。 BTW there are a total of up to 24 threads on each node (48 total) that may access the shared queue.顺便说一句，每个节点上最多有 24 个线程（总共 48 个）可以访问共享队列。

Answer 1

I did some research and and it turns out locks against replicated caches are done against the remote nodes first before trying to lock the keys locally.我做了一些研究，结果发现在尝试在本地锁定密钥之前，首先对远程节点进行了针对复制缓存的锁定。 I believe deadlocks are possible if node1 tries to lock node2 at the same time node2 tries to lock node1.我相信如果 node1 尝试锁定 node2 同时 node2 尝试锁定 node1，死锁是可能的。 Due to this, I've changed all my caches to use Flag.FAIL_SILENTLY, and Flag.ZERO_LOCK_ACQUISITION_TIMEOUT, and added additional retry logic on the client side when adding or removing elements from the queue.因此，我已将所有缓存更改为使用 Flag.FAIL_SILENTLY 和 Flag.ZERO_LOCK_ACQUISITION_TIMEOUT，并在从队列中添加或删除元素时在客户端添加了额外的重试逻辑。 From initial testing things look much better now.从最初的测试来看，现在看起来好多了。

I'm curious what was changed between Infinispan 7 and later versions to make pessimistic locking perform so much worse in newer versions.我很好奇 Infinispan 7 和更高版本之间发生了什么变化，以使悲观锁定在新版本中表现得更差。 The old client code (without Flags or retry logic) worked perfectly under the same testing conditions previously.旧客户端代码（没有标志或重试逻辑）在之前的相同测试条件下完美运行。 I'm suspicious of the changes related to using futures and forkJoinPool, as I've run into issues when using those in other projects and had to go back to doing things the old way with standard Executors.我怀疑与使用期货和 forkJoinPool 相关的变化，因为我在其他项目中使用它们时遇到了问题，不得不 go 回到使用标准执行器的旧方式做事。

Infinispan 9.4.16、JBoss EAP 7.3 与复制缓存 2 节点线程的锁争用是 TIMED_WAITING（停车）

问题描述

1 个解决方案

解决方案1
0 2021-04-24 03:30:50

Infinispan 9.4.16、JBoss EAP 7.3 与复制缓存 2 节点线程的锁争用是 TIMED_WAITING（停车）

问题描述

1 个解决方案

解决方案1 0 2021-04-24 03:30:50

解决方案1
0 2021-04-24 03:30:50