简体   繁体   English

Service Fabric-无法进行配置升级以添加或删除节点

[英]Service Fabric - Cannot do Config upgrade to add or remove nodes

I've got an on-premise Service Fabric consisting of 18 nodes (9 are seed nodes) - secured via gMSA windows security. 我有一个由18个节点(9个是种子节点)组成的本地服务结构-通过gMSA Windows安全性进行保护。 Cluster code version 6.4.622.9590 群集代码版本6.4.622.9590

Unfortunately I have to rebuild 6 of these nodes (3 Seed nodes). 不幸的是,我必须重建其中的6个节点(3个Seed节点)。 They all live in one data center (cluster spans 3 DCs). 它们都位于一个数据中心(集群跨越3个DC)。 As such, I wish to remove these 6 nodes, rebuild them and then re-add them. 因此,我希望删除这6个节点,重建它们,然后重新添加它们。

As per MSDOCs , adding/removing of nodes is performed via config upgrades. 根据MSDOC ,通过配置升级来执行节点的添加/删除。 Note: I've already used this process recently to add 12 nodes so understand the concept of SF config upgrades well. 注意:我最近已经使用此过程添加了12个节点,因此很好地了解了SF配置升级的概念。

Unfortunately, I'm unable to do ANY config upgrades on this cluster until I remove the nodes - this is due to ValidationExceptions reported by the Start-ServiceFabricClusterConfigurationUpgrade powershell command: 不幸的是,直到删除节点,我才能在此群集上进行任何配置升级-这是由于Start-ServiceFabricClusterConfigurationUpgrade powershell命令报告的ValidationExceptions:

  • If I don't add the 6 nodes to the "NodesToBeRemoved" section, I get validation error that not all removed nodes are in this field 如果我没有将6个节点添加到“ NodesToBeRemoved”部分,则会收到验证错误,即并非所有已删除的节点都在此字段中
  • If I do add the 6 nodes, I get the following validation error: 如果我确实添加了6个节点,则会收到以下验证错误:
Start-ServiceFabricClusterConfigurationUpgrade : 
System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
   ...gurationUpgrade], FabricException
    + FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
   onfigurationUpgrade

So, we're stuck! 所以,我们被卡住了! I've also already removed node states, thus leaving all 6 nodes in the "Invalid State". 我还已经删除了节点状态,因此将所有6个节点都保留为“无效状态”。 The Get-ServiceFabricClusterConfiguration does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file. Get-ServiceFabricClusterConfiguration不会返回这6个节点,但是它们仍显示在SF Explorer中,并在群集清单XML文件中列出。

As far as reliability level is concerned - I'm pretty sure one can no longer change this in SF; 就可靠性级别而言-我很确定不能再改变SF了; ie older versions of SF allowed you to configure bronze/silver/gold in config file, but in recent versions (+6.0??) - this is a calculated field and managed internally by SF. 也就是说,较早版本的SF允许您在配置文件中配置青铜/银/金,但是在最新版本(+6.0 ??)中-这是一个计算字段,由SF内部管理。 In any case - because the seed nodes will be decreased from 9 to 6, I suspect the internal calculated reliability level will drop (presumably from Gold to silver). 无论如何-因为种子节点将从9个减少到6个,我怀疑内部计算的可靠性水平会下降(大概是从金级降至银级)。

I've also come across a hack that someone has used to remove nodes in a cluster... but in my scenario, nodes are still listed in manifest file... Nonetheless, the words hack and production should never meet! 我还遇到了一个黑客 ,有人曾经使用它来删除集群中的节点...但是在我的情况下,节点仍列在清单文件中...尽管如此,“ 黑客”和“ 生产 ”一词永远都不会满足!

So, how do I get our production cluster out of this situation? 那么,如何使我们的生产集群摆脱这种局面? Rebuilding the cluster is not an option (that's the whole reason for clusters...high availability!). 重建集群不是一个选择(这就是集群的全部原因……高可用性!)。

I discovered that the above errors are primarily a symptom of lack of clearly documented procedures as well as bad/misleading error messages when doing service fabric configuration upgrades. 我发现上述错误主要是缺乏清晰记录的过程以及服务结构配置升级时错误/误导性错误消息的征兆。

I performed quite a bit of my own testing to make sure I can confidently add/remove several nodes from a cluster. 我进行了相当多的测试,以确保可以放心地从集群中添加/删除多个节点。 I also removed enough nodes to drop the Seed nodes from 9 to 6. 我还删除了足够多的节点,以将“种子”节点从9个降低到6个。

So, to resolve the above issue, here's what I had to do to remove nodes: 因此,要解决上述问题,这是删除节点所要做的:

  1. Use the SF explorer to remove node state - this changed node state from Error to Invalid 使用SF资源管理器删除节点状态-将此节点状态从“错误”更改为“无效”
  2. Get latest json config via Get-ServiceFabricClusterConfiguration 通过Get-ServiceFabricClusterConfiguration获取最新的json配置
  3. Remove the node from Nodes section 从“节点”部分中删除节点
  4. Completely remove the NodesToBeRemoved json section (ie you'll get the inconsistent error if you have an empty list of nodes to be removed - so just remove the containing json block 完全删除NodesToBeRemoved json部分(即,如果要删除的节点为空,则会出现不一致的错误-因此,只需删除包含的json块
  5. Do a config update 进行配置更新

Note: Initially I tried just doing 2-5 above - but it didn't work and the node remained in error state. 注意:最初我只是尝试执行上面2-5的操作-但是它不起作用,并且节点仍处于错误状态。

That said, from my experience, please also note the following when removing nodes (this info is not clear in MSDOC : 就是说,根据我的经验,删除节点时,请注意以下几点(此信息在MSDOC中尚不清楚:

  • You can remove multiple Seed nodes at once (I wanted to do this to try and replicate above scenario) 可以一次删除多个Seed节点(我想这样做以尝试复制上述场景)
  • You can add multiple nodes at once too - just be aware you may not see any activity/indication via SF config upgrade status tooling that anything is happening... be prepared to wait at least +15 minutes (depends on how many nodes you're adding...afterall, SF is copying installation files to the nodes) 您也可以一次添加多个节点-请注意,通过SF config升级状态工具可能看不到任何活动/指示,表明正在发生任何事情...准备等待至少+15分钟(取决于您有多少节点重新添加...毕竟,SF正在将安装文件复制到节点)
  • Sometimes, when removing one or more nodes, the node won't be successfully removed - but left in an Error status. 有时,当删除一个或多个节点时,该节点不会被成功删除-而是处于错误状态。 If this is the case, use the SF Explorer (or powershell) to remove node state. 如果是这种情况,请使用SF Explorer(或Powershell)删除节点状态。 Status will change to Invalid. 状态将变为无效。 At this point, do another config upgrade ensuring that: 此时,请进行另一次配置升级,以确保:
    • The removed node(s) are not in Nodes section 删除的节点不在“节点”部分中
    • The removed node(s) are not in the NodesToBeRemoved list 删除的节点不在NodesToBeRemoved列表中
    • As per above, if the value of NodesToBeRemoved is (or should be) empty, remove this whole JSON block otherwise you'll get a misleading/vague warning about NodesToBeRemoved parameter contains inconsistent information. 如上所述, 如果NodesToBeRemoved的值为(或应该为空),请删除整个JSON块,否则您将收到有关NodesToBeRemoved参数包含不一致信息的误导性/模糊警告。

The latter part really is the confusing part that tripped me up last time. 后面的部分确实是上次使我绊倒的令人困惑的部分。 The thing to also remember is that, once you successfully remove nodes, the Get-ServiceFabricClusterConfiguration will STILL return the removed nodes in the NodesToBeRemoved parameter. 还需要记住的是,一旦成功删除节点, Get-ServiceFabricClusterConfiguration仍将在NodesToBeRemoved参数中返回已删除的节点。 This will likely confuse/trip you up with any subsequent attempts to do a config upgrade. 这可能会使您对以后进行配置升级的任何尝试感到困惑/绊倒。 As such, I recommend you do another final config upgrade with this section completely removed . 因此, 我建议您执行另一项最终配置升级,并完全删除此部分

As a final note: If you re-add a node that has previously been removed, it may come back in a Deactivated status. 最后要注意的是:如果重新添加先前已删除的节点,则该节点可能会恢复为“已停用”状态。 Simply activate this node and all should be fine. 只需激活该节点,一切就可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM