简体   繁体   English

如何在现有的裸机 kubernetes 集群中编辑 etcd 配置

[英]How to edit etcd configuration in an existing bare metal kubernetes cluster

I have a standalone Kuberenets cluster installed on some physical RHEL machine.我在一些物理 RHEL 机器上安装了一个独立的 Kuberenets 集群。

I'm eperiencing recurring crashes of etcd and kube-apiserver containers.我正在经历etcdkube-apiserver容器的反复崩溃。 From their logs, I managed to guess that I need to tune etcd to better perform in this environment.从他们的日志中,我设法猜测我需要调整 etcd 以在这种环境中更好地执行。

The following guide references how to tune etcd: https://etcd.io/docs/v3.4.0/tuning/以下指南参考了如何调整 etcd: https://etcd.io/docs/v3.4.0/tuning/

However, I'm not sure how it can be done in an existing cluster.但是,我不确定如何在现有集群中完成。 Is it a kubernetes native component?是 kubernetes 原生组件吗? Should I patch its deployment?我应该修补它的部署吗? Its quite hard with etcd itself being down. etcd 本身已关闭,这非常困难。

Error logs snippets:错误日志片段:

etcd

[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-03-09 08:09:12.996848 I | etcdmain: etcd Version: 3.4.13
2021-03-09 08:09:12.996909 I | etcdmain: Git SHA: ae9734ed2
2021-03-09 08:09:12.996915 I | etcdmain: Go Version: go1.12.17
2021-03-09 08:09:12.996926 I | etcdmain: Go OS/Arch: linux/amd64
2021-03-09 08:09:12.996932 I | etcdmain: setting maximum number of CPUs to 24, total number of available CPUs is 24
2021-03-09 08:09:12.996939 N | etcdmain: failed to detect default host (could not find default route)
2021-03-09 08:09:12.997017 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-03-09 08:09:12.997071 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 
2021-03-09 08:09:13.004253 I | embed: name = hostname-acme.com
2021-03-09 08:09:13.004268 I | embed: data dir = /var/lib/etcd
2021-03-09 08:09:13.004276 I | embed: member dir = /var/lib/etcd/member
2021-03-09 08:09:13.004284 I | embed: heartbeat = 100ms
2021-03-09 08:09:13.004292 I | embed: election = 1000ms
2021-03-09 08:09:13.004297 I | embed: snapshot count = 10000
2021-03-09 08:09:13.004339 I | embed: advertise client URLs = https://10.43.16.56:2379
2021-03-09 08:09:13.004351 I | embed: initial advertise peer URLs = https://10.43.16.56:2380
2021-03-09 08:09:13.004358 I | embed: initial cluster = 
2021-03-09 08:09:13.004413 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxrwxrwx". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-03-09 08:09:14.066995 I | etcdserver: recovered store from snapshot at index 1140183
2021-03-09 08:09:14.259395 I | mvcc: restore compact to 935600
2021-03-09 08:09:14.538404 I | etcdserver: restarting member fe007e8a424c7486 in cluster 59c13b0ba56e3f74 at commit index 1141310
raft2021/03/09 08:09:14 INFO: fe007e8a424c7486 switched to configuration voters=(18302768017916589190)
raft2021/03/09 08:09:14 INFO: fe007e8a424c7486 became follower at term 1050
raft2021/03/09 08:09:14 INFO: newRaft fe007e8a424c7486 [peers: [fe007e8a424c7486], term: 1050, commit: 1141310, applied: 1140183, lastindex: 1141310, lastterm: 1050]
2021-03-09 08:09:14.539625 I | etcdserver/api: enabled capabilities for version 3.4
2021-03-09 08:09:14.539659 I | etcdserver/membership: added member fe007e8a424c7486 [https://10.43.16.56:2380] to cluster 59c13b0ba56e3f74 from store
2021-03-09 08:09:14.539671 I | etcdserver/membership: set the cluster version to 3.4 from store
2021-03-09 08:09:14.650895 W | auth: simple token is not cryptographically signed
2021-03-09 08:09:14.784236 I | mvcc: restore compact to 935600
2021-03-09 08:09:14.918183 I | etcdserver: starting server... [version: 3.4.13, cluster version: 3.4]
2021-03-09 08:09:14.918532 I | etcdserver: fe007e8a424c7486 as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-03-09 08:09:14.922068 I | embed: ClientTLS: cert = /etc/kubernetes/pki/etcd/server.crt, key = /etc/kubernetes/pki/etcd/server.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = 
2021-03-09 08:09:14.922089 I | embed: listening for peers on 10.43.16.56:2380
2021-03-09 08:09:14.922328 I | embed: listening for metrics on http://127.0.0.1:2381
2021-03-09 08:09:14.925808 W | etcdserver: failed to apply request "header:<ID:8396530591132430964 > lease_revoke:<id:74867815fe26322c>" with response "size:30" took (23.834µs) to execute, err is lease not found
2021-03-09 08:09:14.925900 W | etcdserver: failed to apply request "header:<ID:8396530591132430966 > lease_revoke:<id:74867815fe263251>" with response "size:30" took (18.438µs) to execute, err is lease not found
2021-03-09 08:09:14.925961 W | etcdserver: failed to apply request "header:<ID:8396530591132430968 > lease_revoke:<id:74867815fe26322c>" with response "size:30" took (26.391µs) to execute, err is lease not found
2021-03-09 08:09:14.926062 W | etcdserver: failed to apply request "header:<ID:8396530591132430971 > lease_revoke:<id:74867815fe263251>" with response "size:30" took (18.063µs) to execute, err is lease not found
2021-03-09 08:09:14.926114 W | etcdserver: failed to apply request "header:<ID:8396530591132430973 > lease_revoke:<id:74867815fe26322c>" with response "size:30" took (26.452µs) to execute, err is lease not found
2021-03-09 08:09:14.926194 W | etcdserver: failed to apply request "header:<ID:8396530591132430975 > lease_revoke:<id:74867815fe263251>" with response "size:30" took (17.427µs) to execute, err is lease not found
2021-03-09 08:09:14.926282 W | etcdserver: failed to apply request "header:<ID:8396530591132430979 > lease_revoke:<id:74867815fe26322c>" with response "size:30" took (15.211µs) to execute, err is lease not found
2021-03-09 08:09:14.926333 W | etcdserver: failed to apply request "header:<ID:8396530591132430980 > lease_revoke:<id:74867815fe263251>" with response "size:30" took (24.87µs) to execute, err is lease not found
raft2021/03/09 08:09:15 INFO: fe007e8a424c7486 is starting a new election at term 1050
raft2021/03/09 08:09:15 INFO: fe007e8a424c7486 became candidate at term 1051
raft2021/03/09 08:09:15 INFO: fe007e8a424c7486 received MsgVoteResp from fe007e8a424c7486 at term 1051
raft2021/03/09 08:09:15 INFO: fe007e8a424c7486 became leader at term 1051
raft2021/03/09 08:09:15 INFO: raft.node: fe007e8a424c7486 elected leader fe007e8a424c7486 at term 1051
2021-03-09 08:09:15.804920 I | etcdserver: published {Name:hostname-acme.com ClientURLs:[https://10.43.16.56:2379]} to cluster 59c13b0ba56e3f74
2021-03-09 08:09:15.804960 I | embed: ready to serve client requests
2021-03-09 08:09:15.805085 I | embed: ready to serve client requests
2021-03-09 08:09:15.806545 I | embed: serving client requests on 10.43.16.56:2379
2021-03-09 08:09:15.806594 I | embed: serving client requests on 127.0.0.1:2379
...
2021-03-09 08:15:50.701225 W | etcdserver: read-only range request "key:\"/registry/events/default/xxx-pvc.166a9e8623334d39\" " with result "error:context canceled" took too long (5.378372536s) to execute
WARNING: 2021/03/09 08:15:50 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2021-03-09 08:15:52.165933 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000041013s) to execute
WARNING: 2021/03/09 08:15:52 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2021-03-09 08:15:52.581845 W | etcdserver: read-only range request "key:\"/registry/health\" " with result "error:context deadline exceeded" took too long (2.000024386s) to execute
WARNING: 2021/03/09 08:15:52 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2021-03-09 08:15:52.879773 W | etcdserver: failed to revoke 74867816086e2e90 ("etcdserver: request timed out")

kube-apiserver kube-apiserver

I0309 08:27:06.013263       1 server.go:163] Version: v1.19.7
...
...
E0309 08:31:14.921841       1 storage_rbac.go:317] unable to reconcile rolebinding.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system: Get "https://[::1]:6443/apis/rbac.authorization.k8s.io/v1/namespaces/kube-system/rolebindings/system:controller:token-cleaner": dial tcp [::1]:6443: connect: connection refused
E0309 08:31:14.922948       1 storage_rbac.go:317] unable to reconcile rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public: Get "https://[::1]:6443/apis/rbac.authorization.k8s.io/v1/namespaces/kube-public/rolebindings/system:controller:bootstrap-signer": dial tcp [::1]:6443: connect: connection refused
...
W0309 08:31:33.190649       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0309 08:31:33.255700       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0309 08:31:33.565370       1 controller.go:193] RemoveEndpoints() timed out

Indeed, you will get the YAML of the Deployment , and make the changes.实际上,您将获得Deployment的 YAML 并进行更改。

If you installed the cluster with kubeadm , the file will be under /etc/kubernetes/manifests/ , and once you make the changes, it will automatically be re-deployed.如果您使用kubeadm安装集群,该文件将位于/etc/kubernetes/manifests/下,一旦您进行更改,它将自动重新部署。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM