I have airflow running in k8s containers.
The webserver encountered a DNS error (could not translate the url for my db to an ip) and the webserver workers were killed.
What is troubling me is that the k8s did not attempt to kill the pod and start a new one its place.
Pod log output:
OperationalError: (psycopg2.OperationalError) could not translate host name "my.dbs.url" to address: Temporary failure in name resolution
[2017-12-01 06:06:05 +0000] [2202] [INFO] Worker exiting (pid: 2202)
[2017-12-01 06:06:05 +0000] [2186] [INFO] Worker exiting (pid: 2186)
[2017-12-01 06:06:05 +0000] [2190] [INFO] Worker exiting (pid: 2190)
[2017-12-01 06:06:05 +0000] [2194] [INFO] Worker exiting (pid: 2194)
[2017-12-01 06:06:05 +0000] [2198] [INFO] Worker exiting (pid: 2198)
[2017-12-01 06:06:06 +0000] [13] [INFO] Shutting down: Master
[2017-12-01 06:06:06 +0000] [13] [INFO] Reason: Worker failed to boot.
The k8s status is RUNNING but when I open an exec shell in the k8s UI i get the following output (gunicorn appears to realize it's dead):
root@webserver-373771664-3h4v9:/# ps -Al
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 0 80 0 - 107153 - ? 00:06:42 /usr/local/bin/
4 Z 0 13 1 0 80 0 - 0 - ? 00:01:24 gunicorn: maste <defunct>
4 S 0 2206 0 0 80 0 - 4987 - ? 00:00:00 bash
0 R 0 2224 2206 0 80 0 - 7486 - ? 00:00:00 ps
The following is the YAML for my deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: webserver
namespace: airflow
spec:
replicas: 1
template:
metadata:
labels:
app: airflow-webserver
spec:
volumes:
- name: webserver-dags
emptyDir: {}
containers:
- name: airflow-webserver
image: my.custom.image :latest
imagePullPolicy: Always
resources:
requests:
cpu: 100m
limits:
cpu: 500m
ports:
- containerPort: 80
protocol: TCP
env:
- name: AIRFLOW_HOME
value: /var/lib/airflow
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: db1
key: sqlalchemy_conn
volumeMounts:
- mountPath: /var/lib/airflow/dags/
name: webserver-dags
command: ["airflow"]
args: ["webserver"]
- name: docker-s3-to-backup
image: my.custom.image:latest
imagePullPolicy: Always
resources:
requests:
cpu: 50m
limits:
cpu: 500m
env:
- name: ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws
key: access_key_id
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: aws
key: secret_access_key
- name: S3_PATH
value: s3://my-s3-bucket/dags/
- name: DATA_PATH
value: /dags/
- name: CRON_SCHEDULE
value: "*/5 * * * *"
volumeMounts:
- mountPath: /dags/
name: webserver-dags
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: webserver
namespace: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: webserver
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 75
---
apiVersion: v1
kind: Service
metadata:
labels:
name: webserver
namespace: airflow
spec:
type: NodePort
ports:
- port: 80
selector:
app: airflow-webserver
you need to define the readiness and liveness probe Kubernetes to detect the POD status.
like documented on this page. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-tcp-liveness-probe
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Well, when process dies in a container, this container will exit and kubelet will restart the container on the same node / within the same pod. What happened here is by no means a fault of kubernetes, but in fact a problem of your container. The main process that you launch in the container (be it just from CMD or via ENTRYPOINT) needs to die, for the above to happen, and the ones you launch did not (one went zombie mode, but was not reaped, which is an example of another issue all together - zombie reaping . Liveness probe will help in this case (as mentioned by @sfgroups) as it will terminate the pod if it fails, but this is treating symptoms rather then root cause (not that you shouldn't have probes defined in general as a good practice).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.