简体繁体 English

Galera Cluster 2 节点 - 无法在节点 1 上重新启动 MySQL 服务器

[英]Galera Cluster 2 nodes - Unable to restart MySQL server on node 1

原文 2021-10-04 12:56:08 0 1 mysql/ mariadb/ cluster-computing/ galera/ mariadb-10.2

i am experiencing this issue, similar to this one: Unable to restart MySQL server but i not sure how to proceed so i am asking to the community especially with someone with more experience than me on Galera Cluster.我遇到了这个问题，类似于这个问题：无法重新启动 MySQL 服务器，但我不确定如何继续，所以我向社区询问，尤其是在 Galera Cluster 上有比我更有经验的人。 I'll try to summarize:我会试着总结一下：

Configuration:配置：

Galera Cluster 2 nodes - Every node is an Ubuntu 16.04 and has Mariadb 10.2.17 version. Galera Cluster 2 节点 - 每个节点都是 Ubuntu 16.04 和 Mariadb 10.2.17 版本。

Issue:问题：

One of the node (node1) is in fault, unfortunately there is no error-log or general-log configured but on journalctl i can see that the error is something like "mariadb innodb assertion failure in file" and it's suggest to try innodb_force_recovery (1 to 6) but i don't know how the Galera sync works, or if it's an active\\active configuration so i am not confident to eventually start a node not synchronized since days risking a split brain situation.其中一个节点（node1）出现故障，不幸的是没有配置错误日志或通用日志，但在journalctl上我可以看到错误类似于“文件中的mariadb innodb断言失败”，建议尝试innodb_force_recovery（ 1 到 6) 但我不知道 Galera 同步是如何工作的，或者它是否是一个主动/主动配置，所以我没有信心最终启动一个不同步的节点，因为有几天可能会出现裂脑情况。 Also, i see on the datadir that a file called "sst_in_progress" is present.另外，我在 datadir 上看到存在一个名为“sst_in_progress”的文件。

Consideration:考虑：

Will be ok to eventually delete the datadir on the fault node and restart the mysql service?最终删除故障节点上的datadir并重启mysql服务是否可以？ Could it be enough to make it starts to sync with the node2 replicating the data without touching datas on node2 who is currentrly delivering service to the clients?是否足以让它开始与复制数据的节点 2 同步而不触及当前正在向客户端提供服务的节点 2 上的数据？ Also as far as i understand Galera cluster doesn't replicate system tables so i should export mysql.user table from node2 and import on node1 to have all the users and permission back.此外，据我所知，Galera 集群不复制系统表，所以我应该从 node2 导出 mysql.user 表并在 node1 上导入以恢复所有用户和权限。 Thanks, i hope i succeed to explain the issue, if it's not clear please tell me.谢谢，我希望我能成功解释这个问题，如果不清楚请告诉我。

1 个解决方案

The file sst_in_progress means that the broken node has requested an SST (State Snapshot Transfer), which is basically a full data transfer from the other node in the cluster.文件sst_in_progress表示损坏的节点已经请求了 SST（State Snapshot Transfer），这基本上是从集群中的另一个节点进行的完整数据传输。 There are several different SST methods that you can use, and you can see which one you have enabled by checking the wsrep_sst_method variable.您可以使用多种不同的 SST 方法，您可以通过检查wsrep_sst_method变量来查看您启用了哪一种。 Important to note, is that the donor and joiner node must use the same SST method.需要注意的是，捐助者和加入者节点必须使用相同的 SST 方法。 For more information about the different SST methods, and SST's in general, I recommend the mariadb documentation有关不同 SST 方法和一般 SST 的更多信息，我推荐mariadb 文档

The SST should be able to rejoin the broken node back into the cluster. SST 应该能够将损坏的节点重新加入集群。 You can see the progress of the SST in the mysql error logs.您可以在 mysql 错误日志中看到 SST 的进度。 But, as you do not have that configured, you could instead check the wsrep status (eg show global status like '%wsrep%;' ) on the nodes.但是，由于您没有配置，您可以改为检查节点上的 wsrep 状态（例如， show global status like '%wsrep%;' ）。 You can see the node status by checking wsrep_local_state_comment .您可以通过检查wsrep_local_state_comment来查看节点状态。 If the healthy node is tranfering an SST to the broken node, you will see that the value of wsrep_local_state_comment is Donor/Desynced .如果健康节点正在将 SST 传输到损坏节点，您将看到 wsrep_local_state_comment 的值为Donor/Desynced 。 More detailed information about the various wsrep variables can be found in the Galera documentation有关各种 wsrep 变量的更多详细信息可以在 Galera 文档中找到

If the automatic SST has problems, you can instead do a manual SST.如果自动 SST 有问题，您可以改为进行手动 SST。 For Mariadb 10.1 or later, it is recommended to use Mariabackup for this.对于 Mariadb 10.1 或更高版本，建议为此使用 Mariabackup。 You can find information on doing a manual SST with Mariabackup in the Mariadb documentation .您可以在 Mariadb 文档中找到有关使用 Mariabackup 执行手动 SST 的信息。

In answer to your question about deleting the datadir on the broken node, and restarting the MySQL service: This would force the broken node to request an SST from the other node.回答您关于删除损坏节点上的 datadir 并重新启动 MySQL 服务的问题：这将强制损坏的节点从另一个节点请求 SST。 Please note that starting the MySQL service may timeout, as depending on the size of the datadir, this process can take a long time.请注意，启动 MySQL 服务可能会超时，因为根据 datadir 的大小，此过程可能需要很长时间。

The SST will also transfer the system tables to the broken node, so after the SST is complete, the mysql.user table should be complete with the users and permissions, and you should not need to recreate them. SST 也会将系统表转移到损坏的节点上，所以 SST 完成后，mysql.user 表应该是完整的用户和权限，你应该不需要重新创建它们。

As a side observation, I see that you are using a 2 node Galera cluster.作为旁观，我看到您使用的是 2 节点 Galera 集群。 In general, it is recommended to use at least 3 nodes.一般来说，建议至少使用 3 个节点。 If you are operating a 2 node Galera cluster, you may want to use a Galera arbitrator.如果您运行的是 2 节点 Galera 集群，您可能需要使用 Galera 仲裁器。 More information about that can be found in the Galera documentation更多信息可以在Galera 文档中找到