简体   繁体   English

K8s没有杀死我的气流网络服务器吊舱

[英]K8s did not kill my airflow webserver pod

I have airflow running in k8s containers. 我在k8s容器中运行气流。

The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed. Web服务器遇到DNS错误(无法将我的数据库的URL转换为ip),并且Web服务器工作程序被杀死。

What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place. 让我困扰的是,k8并没有试图杀死吊舱并开始更换新吊舱。

Pod log output: Pod日志输出:

OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution
[2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202)
[2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186)
[2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190)
[2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194)
[2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198)
[2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master
[2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot.

The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead): k8s状态为RUNNING,但是当我在k8s UI中打开exec shell时,我得到以下输出(gunicorn似乎意识到它已经死了):

root@webserver-373771664-3h4v9:/# ps -Al
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0     1     0  0  80   0 - 107153 -     ?        00:06:42 /usr/local/bin/
4 Z     0    13     1  0  80   0 -     0 -      ?        00:01:24 gunicorn: maste <defunct>
4 S     0  2206     0  0  80   0 -  4987 -      ?        00:00:00 bash
0 R     0  2224  2206  0  80   0 -  7486 -      ?        00:00:00 ps

The following is the YAML for my deployment: 以下是我的部署的YAML:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: webserver
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow-webserver
    spec:
      volumes:
      - name: webserver-dags
        emptyDir: {}
      containers:
      - name: airflow-webserver
        image: my.custom.image :latest
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 100m
          limits:
            cpu: 500m
        ports:
        - containerPort: 80
          protocol: TCP
        env:
        - name: AIRFLOW_HOME
          value: /var/lib/airflow
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              name: db1
              key: sqlalchemy_conn
        volumeMounts:
        - mountPath: /var/lib/airflow/dags/
          name: webserver-dags
        command: ["airflow"]
        args: ["webserver"]
      - name: docker-s3-to-backup
        image: my.custom.image:latest
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 50m
          limits:
            cpu: 500m
        env:
        - name: ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws
              key: access_key_id
        - name: SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: aws
              key: secret_access_key
        - name: S3_PATH
          value: s3://my-s3-bucket/dags/
        - name: DATA_PATH
          value: /dags/
        - name: CRON_SCHEDULE
          value: "*/5 * * * *"
        volumeMounts:
        - mountPath: /dags/
          name: webserver-dags
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: webserver
  namespace: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: webserver
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilizationPercentage: 75
---
apiVersion: v1
kind: Service
metadata:
  labels:
  name: webserver
  namespace: airflow
spec:
  type: NodePort
  ports:
  - port: 80
  selector:
    app: airflow-webserver

you need to define the readiness and liveness probe Kubernetes to detect the POD status. 您需要定义就绪和活跃度探针Kubernetes来检测POD状态。

like documented on this page. 就像本页记录的一样。 https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe

 - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. 好吧,当进程在容器中死亡时,该容器将退出,而kubelet将在同一节点中的同一节点/内重新启动该容器。 What happened here is by no means a fault of kubernetes, but in fact a problem of your container. 这里发生的事情绝不是Kubernetes的错,而实际上是您的容器问题。 The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping . Liveness probe will help in this case (as mentioned by @sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice). 为使上述操作发生,您在容器中启动的主要过程(无论是从CMD还是通过ENTRYPOINT启动)都必须终止,而您启动的过程却没有(一个进入僵尸模式,但没有收获,这是一个另一个问题的例子- 僵尸收割 。这种情况下, 活动探测将有所帮助(如@sfgroups所述),因为它会在失败时终止pod,但这是在治疗症状,而不是根本原因(不是您不应该这样做)有一般定义为良好做法的探针)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM