kubernetes pod failed with Back-off restarting failed container

Question

I am trying to setup prometheus logging, I am trying to deploy below yamls but pod is failed with "Back-off restarting failed container"我正在尝试设置普罗米修斯日志记录，我正在尝试在 yamls 下进行部署，但 pod 因“后退重启失败的容器”而失败

Complete description:完整说明：

Name:         prometheus-75dd748df4-wrwlr
Namespace:    monitoring
Priority:     0
Node:         kbs-vm-02/172.16.1.8
Start Time:   Tue, 28 Apr 2020 06:13:22 +0000
Labels:       app=prometheus
              pod-template-hash=75dd748df4
Annotations:  <none>
Status:       Running
IP:           10.44.0.7
IPs:
  IP:           10.44.0.7
Controlled By:  ReplicaSet/prometheus-75dd748df4
Containers:
  prom:
    Container ID:  docker://50fb273836c5522bbbe01d8db36e18688e0f673bc54066f364290f0f6854a74f
    Image:         quay.io/prometheus/prometheus:v2.4.3
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:8e0e85af45fc2bcc18bd7221b8c92fe4bb180f6bd5e30aa2b226f988029c2085
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --config.file=/prometheus-cfg/prometheus.yml
      --storage.tsdb.path=/data
      --storage.tsdb.retention=$(STORAGE_LOCAL_RETENTION)
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 28 Apr 2020 06:14:08 +0000
      Finished:     Tue, 28 Apr 2020 06:14:08 +0000
    Ready:          False
    Restart Count:  3
    Limits:
      memory:  1Gi
    Requests:
      cpu:     200m
      memory:  500Mi
    Environment Variables from:
      prometheus-config-flags  ConfigMap  Optional: false
    Environment:               <none>
    Mounts:
      /data from storage (rw)
      /prometheus-cfg from config-file (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-token-bt7dw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-file:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-config-file
    Optional:  false
  storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-storage-claim
    ReadOnly:   false
  prometheus-token-bt7dw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-token-bt7dw
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                From                Message
  ----     ------            ----               ----                -------
  Warning  FailedScheduling  76s (x3 over 78s)  default-scheduler   running "VolumeBinding" filter plugin for pod "prometheus-75dd748df4-wrwlr": pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled         73s                default-scheduler   Successfully assigned monitoring/prometheus-75dd748df4-wrwlr to kbs-vm-02
  Normal   Pulled            28s (x4 over 72s)  kubelet, kbs-vm-02  Container image "quay.io/prometheus/prometheus:v2.4.3" already present on machine
  Normal   Created           28s (x4 over 72s)  kubelet, kbs-vm-02  Created container prom
  Normal   Started           27s (x4 over 71s)  kubelet, kbs-vm-02  Started container prom
  Warning  BackOff           13s (x6 over 69s)  kubelet, kbs-vm-02  Back-off restarting failed container

deployment file:部署文件：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
        fsGroup: 1000
      serviceAccountName: prometheus
      containers:
      - image: quay.io/prometheus/prometheus:v2.4.3
        name: prom
        args:
        - '--config.file=/prometheus-cfg/prometheus.yml'
        - '--storage.tsdb.path=/data'
        - '--storage.tsdb.retention=$(STORAGE_LOCAL_RETENTION)'
        envFrom:
        - configMapRef:
            name: prometheus-config-flags
        ports:
        - containerPort: 9090
          name: prom-port
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: 200m
            memory: 500Mi
        volumeMounts:
        - name: config-file
          mountPath: /prometheus-cfg
        - name: storage
          mountPath: /data
      volumes:
      - name: config-file
        configMap:
          name: prometheus-config-file
      - name: storage
        persistentVolumeClaim:
          claimName: prometheus-storage-claim

PV Yaml: PV Yaml：

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-storage
  namespace: monitoring
  labels:
    app: prometheus
spec:
  capacity:
    storage: 12Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data"

PVC Yaml data: PVC Yaml 数据：

[vidya@KBS-VM-01 7-1_prometheus]$ cat prometheus/prom-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-storage-claim
  namespace: monitoring
  labels:
    app: prometheus
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Do you know what is the issue and how to fix it.你知道是什么问题以及如何解决它。 Please also let me know any more files need to be share,还请让我知道需要共享更多文件，

My Guess is something problem with storage configs, seeing at events logs我的猜测是存储配置有问题，在事件日志中看到

Warning FailedScheduling 76s (x3 over 78s) default-scheduler running "VolumeBinding" filter plugin for pod "prometheus-75dd748df4-wrwlr": pod has unbound immediate PersistentVolumeClaims警告 FailedScheduling 76s (x3 over 78s) default-scheduler running "VolumeBinding" filter plugin for pod "prometheus-75dd748df4-wrwlr": pod has unbound immediate PersistentVolumeClaims

I am using local storage.我正在使用本地存储。

[vidya@KBS-VM-01 7-1_prometheus]$ kubectl describe pvc prometheus-storage-claim -n monitoring
Name:          prometheus-storage-claim
Namespace:     monitoring
StorageClass:
Status:        Bound
Volume:        prometheus-storage
Labels:        app=prometheus
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      12Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    prometheus-75dd748df4-wrwlr
Events:
  Type    Reason         Age   From                         Message
  ----    ------         ----  ----                         -------
  Normal  FailedBinding  37m   persistentvolume-controller  no persistent volumes available for this claim and no storage class is set



[vidya@KBS-VM-01 7-1_prometheus]$ kubectl logs prometheus-75dd748df4-zlncv -n monitoring
level=info ts=2020-04-28T07:49:07.885529914Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2020-04-28T07:49:07.885635014Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2020-04-28T07:49:07.885812014Z caller=main.go:240 host_details="(Linux 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 prometheus-75dd748df4-zlncv (none))"
level=info ts=2020-04-28T07:49:07.885833214Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-04-28T07:49:07.885849614Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-04-28T07:49:07.888695413Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2020-04-28T07:49:07.889017612Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2020-04-28T07:49:07.889033512Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2020-04-28T07:49:07.889041112Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2020-04-28T07:49:07.889048812Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2020-04-28T07:49:07.889071612Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2020-04-28T07:49:07.889083112Z caller=main.go:453 msg="Scrape manager stopped"
level=info ts=2020-04-28T07:49:07.889098012Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-04-28T07:49:07.889109912Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-04-28T07:49:07.889124912Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2020-04-28T07:49:07.889137812Z caller=main.go:608 msg="Notifier manager stopped"
level=info ts=2020-04-28T07:49:07.889169012Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=error ts=2020-04-28T07:49:07.889653412Z caller=main.go:617 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

Answer 1

The problem here is pvc is not bound to the pv primarily because there is no storage class to link the pv with pvc and the capacity in pv(12Gi) and requests in pvc(10Gi) is not matching.这里的问题是 pvc 没有绑定到 pv 主要是因为没有存储 class 将 pv 与 pvc 链接，并且 pv(12Gi) 中的容量和 pvc(10Gi) 中的请求不匹配。 So at the end kubernetes could not figure out which pv the pvc should be bound to.所以最后 kubernetes 无法确定 pvc 应该绑定到哪个 pv。

Add storageClassName: manual in spec of both PV and PVC.在 PV 和 PVC 的规范中添加storageClassName: manual 。
Make the capacity in PV and requests in PVC same ie 10Gi使 PV 中的容量和 PVC 中的请求相同，即 10Gi

PV光伏

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-storage
  namespace: monitoring
  labels:
    app: prometheus
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data"

PVC PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-storage-claim
  namespace: monitoring
  labels:
    app: prometheus
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Update:更新：

Running the pod as root by adding runAsUser: 0 should solve the open /data/lock: permission denied error通过添加runAsUser: 0以 root 身份运行 pod 应该可以解决open /data/lock: permission denied错误

kubernetes pod failed with Back-off restarting failed container

问题描述

Complete description:完整说明：

deployment file:部署文件：

PV Yaml: PV Yaml：

1 个解决方案

解决方案1
1 已采纳 2020-04-28 06:52:06

kubernetes pod failed with Back-off restarting failed container

问题描述

Complete description:完整说明：

deployment file:部署文件：

PV Yaml: PV Yaml：

1 个解决方案

解决方案1 1 已采纳 2020-04-28 06:52:06

解决方案1
1 已采纳 2020-04-28 06:52:06