简体   繁体   English

故障的Kube.netes Ceph节点如何自动删除?

[英]How can a failed Kubernetes Ceph node be deleted automatically?

On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC.在多节点且使用 RWO 模式的 Ceph 块卷的环境中,如果一个节点发生故障(不可达且不会很快恢复)并且 pod 被重新调度到另一个节点,如果 pod 有一个Ceph 块 PVC。 The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).原因是该卷“仍在被另一个 pod 使用”(因为当节点发生故障时,无法正确删除其资源)。

If I remove the node from the cluster using kubectl delete node dead-node , the pod can start because the resources get removed.如果我使用kubectl delete node dead-node从集群中删除节点,则 pod 可以启动,因为资源已删除。

How can I do this automatically?我怎样才能自动执行此操作? Some possibilities I have thought about are:我考虑过的一些可能性是:

  • Can I set a force detach timeout for the volume?我可以为卷设置强制分离超时吗?
  • Set a delete node timeout?设置删除节点超时?
  • Automatically delete a node with given taints?自动删除具有给定污点的节点?

I can use the ReadWriteMany mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.我可以将ReadWriteMany模式与其他卷类型一起使用,以便能够让 PV 被多个 pod 使用,但这并不理想。

You can probably have a sidecar container and tweak the Readiness and Liveness probes in your pod so that the pod doesn't restart if a Ceph block volume is unreachable for some time by the container that it's using it.您可能有一个sidecar 容器,并在您的 pod 中调整Readiness 和 Liveness探测器,这样当使用它的容器在一段时间内无法访问 Ceph 块卷时,该 pod 不会重新启动。 (There may be other implications to your application though) (虽然可能对您的申请有其他影响)

Something like this:像这样:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: ceph
  name: ceph-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5
  - name: cephclient
    image: ceph
    volumeMounts:
    - name: ceph
      mountPath: /cephmountpoint
    livenessProbe:
      ... 👈 something
      initialDelaySeconds: 5
      periodSeconds: 3600 👈 make this real long

✌️☮️ ✌️☮️

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM