简体   繁体   English

Pacemaker 集群永久停止所有资源

[英]Pacemaker cluster stops all resources permanently

Is it possible to configure Pacemaker resource group in such way that in case of resource timeout on invoking any operation (monitor, start, stop may be ignored), cluster manager will migrate resources to a Standby node?是否可以配置 Pacemaker 资源组,以便在调用任何操作(监控、启动、停止可能被忽略)时资源超时,集群管理器会将资源迁移到备用节点? If there will again problem on Standby node, it will bring resources back to Primary node, etc. It will continue retrying for 5 hours or even indefinitely.如果Standby节点再次出现问题,它会将资源带回Primary节点等。它将继续重试5小时甚至无限期。

In real situation when external systems are down, keeping restating is the only way to make service back to available asap.在外部系统关闭的实际情况下,保持重述是使服务尽快恢复可用的唯一方法。

Long story here: I'm building resource managers for OCI Public and Private IP.长话短说我正在为 OCI 公共和私有 IP 构建资源管理器。 In the Oracle Cloud assignment of floating routable IP and internal one requires interaction with OCI API to configure virtual network side.在Oracle云分配浮动可路由IP和内部1需要与OCI交互API配置虚拟网络侧。 I followed Dummy exemplary code;我遵循了 Dummy 示例代码; did few mistakes an errors to finally have code passed to production.几乎没有错误,最终将代码传递给生产。 Resource group looks as following: floating IPs, routes, and systemd service.资源组如下所示:浮动 IP、路由和 systemd 服务。 I've configured migration-threshold to 5, and resource-stickiness as 100.我已将迁移阈值配置为 5,将资源粘性配置为 100。

 Resource Group: libreswan
 ipsec_cluster_routing_no1  (ocf::heartbeat:Route): Started node1
 ipsec_cluster_public_ip    (ocf::heartbeat:oci_publicip):  Started node1
 ipsec_cluster_private_ip_no1   (ocf::heartbeat:oci_privateip): Started node1
 ipsec_cluster_private_ip_no2   (ocf::heartbeat:oci_privateip): Started node1
 ipsec_cluster_inet_ip_no1  (ocf::heartbeat:IPaddr2):   Started node1
 ipsec_cluster_inet_ip_no2  (ocf::heartbeat:IPaddr2):   Started node1
 ipsec_cluster_routing_no2  (ocf::heartbeat:Route): Started node1
 ipsec_cluster_libreswan    (systemd:ipsec):    Started node1

Recently due to temporary unavailability of OCI API, cluster manager stopped whole resource group due to 30 sec.最近由于 OCI API 暂时不可用,集群管理器由于 30 秒停止了整个资源组。 timeout on monitor() operation on one of oci_privateip resources.对 oci_privateip 资源之一的 monitor() 操作超时。

In logs, I see 5 times retry sequence: monitor, stop, start.在日志中,我看到 5 次重试顺序:监控、停止、启动。 But after that cluster manager gives up, leaving resources in Stopped state.但是在集群管理器放弃之后,将资源留在 Stopped state 中。 I'd like cluster manager to keep retrying.我希望集群管理器继续重试。

SOLVED.解决了。

  sudo pcs resource meta $res failure-timeout=120
  sudo pcs resource meta $res migration-threshold=5

makes "failed" node ready to take back resources after 120 seconds.使“失败”节点准备好在 120 秒后收回资源。 Failed node before giving up will retry 5 times, so with 30 sec timeout will keep retrying for 2.5 minutes.放弃前失败的节点将重试 5 次,因此 30 秒超时将继续重试 2.5 分钟。

More info: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/s1-resourceopts-haar更多信息: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/s1-resourceopts-haar

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM