简体   繁体   English

对于具有大量来宾可执行应用程序的集群,应在 Service Fabric 放置/负载平衡配置中设置哪些阈值?

[英]What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?

What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?对于具有大量来宾可执行应用程序的集群,应在 Service Fabric 放置/负载平衡配置中设置哪些阈值?

I am having trouble with Service Fabric trying to place too many services onto a single node too fast.我在 Service Fabric 尝试将太多服务太快地放置到单个节点上时遇到问题。

To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas.举一个集群大小的例子,有 2-4 个工作节点类型,每个节点类型有 3-6 个工作节点,每个节点类型可以运行 200 个来宾可执行应用程序,每个应用程序至少有 2 个副本。 The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.节点在运行时能够运行服务,只是CPU太高的启动时间。

The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config.问题似乎是集群配置中设置的放置和负载平衡规则的阈值或默认值。 As examples of what I have tried: I have turned on InBuildThrottlingEnabled and set InBuildThrottlingGlobalMaxValue to 100 , I have set the Global Movement Throttle settings to be various percentages of the total application count.作为我尝试过的示例:我打开了InBuildThrottlingEnabled并将InBuildThrottlingGlobalMaxValue设置为100 ,我将 Global Movement Throttle 设置设置为总应用程序计数的不同百分比。

At this point there are two distinct scenarios I am trying to solve for.在这一点上,我试图解决两种不同的场景。 In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.在这两种情况下,节点 go 在一段时间内达到 100%,以便服务结构将节点声明为关闭。

1st: Starting an entire cluster from all nodes being off without overwhelming nodes. 1st:从所有节点关闭开始整个集群,而不会压倒节点。

2nd: A single node being overwhelmed by too many services starting after a host comes back online第二:主机重新上线后启动的服务过多,单个节点不堪重负

Here are my current parameters on the cluster:这是我在集群上的当前参数:

       "Name": "PlacementAndLoadBalancing",
       "Parameters": [
         {
           "Name": "UseMoveCostReports",
           "Value": "true"
         },
         {
           "Name": "PLBRefreshGap",
           "Value": "1"
         },
         {
           "Name": "MinPlacementInterval",
           "Value": "30.0"
         },
         {
           "Name": "MinLoadBalancingInterval",
           "Value": "30.0"
         },
         {
           "Name": "MinConstraintCheckInterval",
           "Value": "30.0"
         },
         {
           "Name": "GlobalMovementThrottleThresholdForPlacement",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleThresholdForBalancing",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleThreshold",
           "Value": "25"
         },
         {
           "Name": "GlobalMovementThrottleCountingInterval",
           "Value": "450"
         },
         {
           "Name": "InBuildThrottlingEnabled",
           "Value": "false"
         },
         {
           "Name": "InBuildThrottlingGlobalMaxValue",
           "Value": "100"
         }
       ]
     },

Based on discussion in answer below , wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here.根据下面答案中的讨论,想留下一个图形图像:如果一个节点出现故障,将服务改组到其余节点的行为将导致第二个节点 go 下降,如此处所述。 Green node goes down, then purple goes down due to too many resources being shuffled onto it.绿色节点关闭,然后紫色节点由于过多的资源被洗牌而关闭。

一张图说明了上述情况。绿色下降,然后是紫色

From SF's perspective, 1 & 2 are the same problem.从 SF 的角度来看,1 和 2 是同一个问题。 Also as a note, SF doesn't evict a node just because CPU consumption is high.另外需要注意的是,SF 不会仅仅因为 CPU 消耗高就驱逐节点。 So: "The nodes go to 100% for an amount of time such that service fabric declares the node as down."因此:“节点 go 在一段时间内达到 100%,以便服务结构将节点声明为关闭。” needs some more explanation.需要更多解释。 The machines might be failing for other reasons, or I guess could be so loaded that the kernel level failure detectors can't ping other machines, but that isn't very common.机器可能由于其他原因出现故障,或者我猜可能负载如此之大,以至于 kernel 级故障检测器无法 ping 其他机器,但这并不常见。

For config changes: I would remove all of these to go with the defaults对于配置更改:我会将所有这些删除到 go 并使用默认值

 {
   "Name": "PLBRefreshGap",
   "Value": "1"
 },
 {
   "Name": "MinPlacementInterval",
   "Value": "30.0"
 },
 {
   "Name": "MinLoadBalancingInterval",
   "Value": "30.0"
 },
 {
   "Name": "MinConstraintCheckInterval",
   "Value": "30.0"
 },

For the inbuild throttle to work, this needs to flip to true:为了使内置油门起作用,这需要翻转为 true:

     {
       "Name": "InBuildThrottlingEnabled",
       "Value": "false"
     },

Also, since these are likely constraint violations and placement (not proactive rebalancing) we need to explicitly instruct SF to throttle those operations as well.此外,由于这些可能是违反约束和放置(不是主动重新平衡),我们需要明确指示 SF 也限制这些操作。 There is config for this in SF, although it is not documented or publicly supported at this time, you can see it in the settings . SF 中对此有配置,虽然目前没有记录或公开支持,但您可以在设置中看到它 By default only balancing is throttled, but you should be able to turn on throttling for all phases and set appropriate limits via something like the below.默认情况下,只有平衡受到限制,但您应该能够为所有阶段打开限制并通过如下所示设置适当的限制。

These first two settings are also within PlacementAndLoadBalancing, like the ones above.前两个设置也在 PlacementAndLoadBalancing 中,就像上面的设置一样。

 {
   "Name": "ThrottlePlacementPhase",
   "Value": "true"
 },
 {
   "Name": "ThrottleConstraintCheckPhase",
   "Value": "true"
 },

These next settings to set the limits are in their own sections, and are a map of the different node type names to the limit you want to throttle for that node type.这些用于设置限制的下一个设置在它们自己的部分中,并且是不同节点类型名称的 map 到您要为该节点类型限制的限制。

{
"name": "MaximumInBuildReplicasPerNodeConstraintCheckThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNodePlacementThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNodeBalancingThrottle",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
},
{
"name": "MaximumInBuildReplicasPerNode",
"parameters": [
  {
      "name": "YourNodeTypeNameHere",
      "value": "100"
  },
  {
      "name": "YourOtherNodeTypeNameHere",
      "value": "100"
  }
]
}

I would make these changes and then try again.我会做出这些改变,然后再试一次。 Additional information like what is actually causing the nodes to be down (confirmed via events and SF health info) would help identify the source of the problem.诸如实际导致节点关闭的其他信息(通过事件和 SF 健康信息确认)将有助于确定问题的根源。 It would probably also be good to verify that starting 100 instances of the apps on the node actually works and whether that's an appropriate threshold.验证在节点上启动 100 个应用程序实例是否确实有效以及这是否是适当的阈值也可能会很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM