简体   繁体   中英

K8s did not kill my airflow webserver pod

I have airflow running in k8s containers.

The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed.

What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place.

Pod log output:

OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution
[2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202)
[2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186)
[2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190)
[2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194)
[2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198)
[2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master
[2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot.

The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead):

root@webserver-373771664-3h4v9:/# ps -Al
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0     1     0  0  80   0 - 107153 -     ?        00:06:42 /usr/local/bin/
4 Z     0    13     1  0  80   0 -     0 -      ?        00:01:24 gunicorn: maste <defunct>
4 S     0  2206     0  0  80   0 -  4987 -      ?        00:00:00 bash
0 R     0  2224  2206  0  80   0 -  7486 -      ?        00:00:00 ps

The following is the YAML for my deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: webserver
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow-webserver
    spec:
      volumes:
      - name: webserver-dags
        emptyDir: {}
      containers:
      - name: airflow-webserver
        image: my.custom.image :latest
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 100m
          limits:
            cpu: 500m
        ports:
        - containerPort: 80
          protocol: TCP
        env:
        - name: AIRFLOW_HOME
          value: /var/lib/airflow
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              name: db1
              key: sqlalchemy_conn
        volumeMounts:
        - mountPath: /var/lib/airflow/dags/
          name: webserver-dags
        command: ["airflow"]
        args: ["webserver"]
      - name: docker-s3-to-backup
        image: my.custom.image:latest
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 50m
          limits:
            cpu: 500m
        env:
        - name: ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws
              key: access_key_id
        - name: SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: aws
              key: secret_access_key
        - name: S3_PATH
          value: s3://my-s3-bucket/dags/
        - name: DATA_PATH
          value: /dags/
        - name: CRON_SCHEDULE
          value: "*/5 * * * *"
        volumeMounts:
        - mountPath: /dags/
          name: webserver-dags
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: webserver
  namespace: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: webserver
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilizationPercentage: 75
---
apiVersion: v1
kind: Service
metadata:
  labels:
  name: webserver
  namespace: airflow
spec:
  type: NodePort
  ports:
  - port: 80
  selector:
    app: airflow-webserver

you need to define the readiness and liveness probe Kubernetes to detect the POD status.

like documented on this page. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe

 - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. What happened here is by no means a fault of kubernetes, but in fact a problem of your container. The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping . Liveness probe will help in this case (as mentioned by @sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM