Percona mysql xtradb 集群无法正常启动并且节点重启不起作用

Question

tl;dr tl;博士

When starting a fresh percona cluster of 3 kubernetes pods, the grastate.dat seq_no is set at -1 and doesn't change.当启动一个新的由 3 个 kubernetes pod 组成的 percona 集群时， grastate.dat seq_no设置为-1并且不会改变。 On deleting one pod and watching it restart, expecting it to rejoin the cluster, it sets it's inital position to 00000000-0000-0000-0000-000000000000:-1 and tries to connect to itself (it's former ip), maybe because it'd been the first pod in the cluster?在删除一个 pod 并观察它重新启动时，期望它重新加入集群，它将它的初始位置设置为00000000-0000-0000-0000-000000000000:-1并尝试连接到它自己（它是以前的 ip），也许是因为它“ d 是集群中的第一个 pod？ It then timeouts in it's erroneous connection to itself:然后它在与自身的错误连接中超时：

2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S

The cluster doesn't get started properly and I'm unable to successfully restart pods in the cluster.集群未正确启动，我无法成功重启集群中的 pod。

Full满的

When I start the cluster from scratch.当我从头开始启动集群时。 With blank data directories and a fresh etcd cluster, everything seems to come up.有了空白的数据目录和一个新的 etcd 集群，一切似乎都出现了。 However I look at the grastate.dat and I find that the seq_no for each pod is -1 :但是，我查看了grastate.dat ，发现每个 pod 的seq_no是-1 ：

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0

At this point I can do mysql -h percona -u wordpress -p and connect and wordpress works too.此时我可以执行mysql -h percona -u wordpress -p并连接并且 wordpress 也可以工作。

Scenario: I have 3 percona pods场景：我有 3 个 percona pod

/ # jonathan@ubuntu:~/Projects/k8wp$ kubectl get pods
NAME                         READY     STATUS    RESTARTS   AGE
etcd-0                       1/1       Running   1          12h
etcd-1                       1/1       Running   0          12h
etcd-2                       1/1       Running   3          12h
etcd-3                       1/1       Running   1          12h
percona-0                    1/1       Running   0          8m
percona-1                    1/1       Running   0          57m
percona-2                    1/1       Running   0          57m

When I try to restart percona-0 it gets kicked out of the cluster on restarting, percona-0's gvwstate.dat file shows当我尝试重新启动 percona-0 时，它会在重新启动时被踢出集群，percona-0 的gvwstate.dat文件显示

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat
my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523
#vwbeg
view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3
bootstrap: 0
member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend

The other 2 pods in the cluster show:集群中的其他 2 个 pod 显示：

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat
my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat
my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend

Here are what I think are the relevant errors from percona-0's startup:以下是我认为来自 percona-0 启动的相关错误：

2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'
2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb {
    b7571ff8,0
} joined {
} left {
} partitioned {
})
2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected
2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001'
2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete.
2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16]
2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY.
2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1

The ip it's trying to connect to 10.52.0.26 in 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:' is actually that pods previous ip, here's the listing of keys in etcd I did before deleting percona-0它在2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'尝试连接到10.52.0.26的 ip 2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'之前实际上是那个 pod ip，这是我在删除 percona-0 之前所做的 etcd 中的键列表

/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/wordpress
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress-001/10.52.0.26
/pxc-cluster/wordpress-001/10.52.0.26/hostname
/pxc-cluster/wordpress-001/10.52.0.26/ipaddr

After kubectl delete pods/percona-0:在 kubectl 删除 pods/percona-0 之后：

/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress

Also during the restart percona-0 tried to register to etcd with:同样在重启期间 percona-0 尝试注册到 etcd：

{"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}}
{"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}}

which doesn't work.这不起作用。

From the second member of the cluster percona-1 :从集群的第二个成员percona-1 ：

2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567 
2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0
2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8
2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive
2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable

New Info: I restarted percona-0 again, and this time it somehow came up!新信息：我再次重新启动了 percona-0，这次它以某种方式出现了！ After a few tries I realised the pod needs to restarted twice to come up ie after deleting it the first time, it comes up with the above errors, after deleting it the second time it comes up okay and syncs with the other members.经过几次尝试，我意识到 pod 需要重新启动两次才能出现，即在第一次删除它后，它出现了上述错误，第二次删除它后它出现了，并与其他成员同步。 Could this be because it was the first pod in the cluster?这可能是因为它是集群中的第一个 Pod？

I've tested deleting the other pods but they all come back up okay.我已经测试过删除其他豆荚，但它们都恢复正常。

The issue only lies with percona-0.问题仅在于 percona-0。

Also;还; Taking down all the pods at once, if my node was to crash, that's the situation where the pods don't come back up at all!立即取下所有 Pod，如果我的节点崩溃，那就是 Pod 根本无法恢复的情况！ I suspect it's because no state is saved to grastate.dat , ie seq_no remains -1 even though the global id may change, the pods exit with mysqld shutdown, and the following errors:我怀疑这是因为没有状态被保存到 grastate.dat ，即 seq_no 保持 -1 即使全局 id 可能改变，pods 退出并关闭 mysqld，并且出现以下错误：

jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR
2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR
2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out)
2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7
2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR
2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting

grastate.dat on deleting all pods: grastate.dat删除所有豆荚：

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0

No, gvwstate.dat不，gvwstate.dat

Answer 1

Fixed it with changing the entrypoint in the container to the following script:通过将容器中的入口点更改为以下脚本来修复它：

#!/bin/bash
sed -i \"s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1\" /var/lib/mysql/grastate.dat; 
/entrypoint.sh --wsrep-new-cluster;

Thanks to https://www.claudiokuenzler.com/blog/494/galera-cluster-mysql-not-starting-failed-to-open-channel-reach-primary#.WNesDiF97Qo感谢https://www.claudiokuenzler.com/blog/494/galera-cluster-mysql-not-starting-failed-to-open-channel-reach-primary#.WNesDiF97Qo

The issue is, when restarting the 3 pods from a crash, they all hit the following error:问题是，当从崩溃中重新启动 3 个 pod 时，它们都遇到了以下错误：

[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)

What that means (summarizing from the the link), is that since all the pods are down the first pod (the pods are managed by a statefulset) comes up and tries to reconnect to the cluster but doesn't find any other pods it can connect to, so it goes down, the next pod comes up tries the same thing, hits the same error and goes down to etc etc这意味着（从链接中总结），因为所有的 pod 都关闭了，所以第一个 pod（这些 pod 由 statefulset 管理）出现并尝试重新连接到集群，但没有找到它可以找到的任何其他 pod连接到，所以它关闭，下一个 pod 出现尝试相同的事情，遇到相同的错误并下降到等等

The solution is for the first pod to start a new cluster when it comes up then all the subsequent will come up and find a node to connect to.解决方案是让第一个 Pod 在它出现时启动一个新集群，然后所有后续都会出现并找到要连接的节点。 It'll still come up with all the data.它仍然会提供所有数据。

So with percona xtradb the docker container's entrypoint looks like:因此，使用 percona xtradb 时，docker 容器的入口点如下所示：

exec mysqld --user=mysql --wsrep_cluster_name=$CLUSTER_NAME --wsrep_cluster_address="gcomm://$cluster_join" --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth="xtrabackup:$XTRABACKUP_PASSWORD" --log-error=${DATADIR}error.log $CMDARG

So all I have to do to get the setup running is pass the earlier argument --wsrep-new-cluster to the /entrypoint.sh file like so:因此，为了让安装程序运行，我所要做的就是将前面的参数--wsrep-new-cluster传递给 /entrypoint.sh 文件，如下所示：

/entrypoint.sh --wsrep-new-cluster

PS// I tried the above at first alone but I ran into an error stating that to force a new cluster and bootstrap with that node I had to set safe_to_bootstrap from 0 to 1 in /var/lib/mysql/grastate.dat PS//我一开始单独尝试了上面的方法，但我遇到了一个错误，指出要强制使用该节点创建新集群和引导程序，我必须在/var/lib/mysql/grastate.dat中将safe_to_bootstrap从 0 设置为 1

Percona mysql xtradb 集群无法正常启动并且节点重启不起作用

问题描述

1 个解决方案

解决方案1
0 2017-03-26 13:13:47

Percona mysql xtradb 集群无法正常启动并且节点重启不起作用

问题描述

1 个解决方案

解决方案1 0 2017-03-26 13:13:47

解决方案1
0 2017-03-26 13:13:47