We are regularly seeing connection refused errors on a bespoke NGINX reverse proxy installed in AWS EKS. (see below for kubernetes template)
Initially, we thought it was an issue with the load balancer. However, upon further investigation, there seems to be an issue between the kube-proxy and the nginx Pod.
When I run repeated wget IP:PORT
against just the node's internal IP and the desired node port that serves, we will see bad request several times and eventually, a failed: Connection refused
Whereas when I run a request just against the Pod IP and Port, I can not get this connection refused.
Example wget output
Fail:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... failed: Connection refused.
Success:
wget ip.ap-southeast-2.compute.internal:30102
--2020-06-26 01:15:31-- http://ip.ap-southeast-2.compute.internal:30102/
Resolving ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)... 10.1.95.3
Connecting to ip.ap-southeast-2.compute.internal (ip.ap-southeast-2.compute.internal)|10.1.95.3|:30102... connected.
HTTP request sent, awaiting response... 400 Bad Request
2020-06-26 01:15:31 ERROR 400: Bad Request.
In the logs on the NGINX service, we don't see the connection refused the request, whereas we do see the other BAD REQUEST ones.
I have read about several issues regarding kube-proxy
and I am interested in other insights to improve this situation.
eg: https://github.com/kubernetes/kubernetes/issues/38456
Any help much appreciated.
Kubernetes Template
##
# Main nginx deployment. Requires updated tag potentially for
# docker image
##
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-lua-ssl-deployment
labels:
service: https-custom-domains
spec:
selector:
matchLabels:
app: nginx-lua-ssl
replicas: 5
template:
metadata:
labels:
app: nginx-lua-ssl
service: https-custom-domains
spec:
containers:
- name: nginx-lua-ssl
image: "0000000000.dkr.ecr.ap-southeast-2.amazonaws.com/lua-resty-auto-ssl:v0.NN"
imagePullPolicy: Always
ports:
- containerPort: 8080
- containerPort: 8443
- containerPort: 8999
envFrom:
- configMapRef:
name: https-custom-domain-conf
##
# Load balancer which manages traffic into the nginx instance
# In aws, this uses an ELB (elastic load balancer) construct
##
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
name: nginx-lua-load-balancer
labels:
service: https-custom-domains
spec:
ports:
- name: http
port: 80
targetPort: 8080
- name: https
port: 443
targetPort: 8443
externalTrafficPolicy: Local
selector:
app: nginx-lua-ssl
type: LoadBalancer
It's a tricky one because it could be at any layer of your stack.
A couple of pointers:
Check the logs of the kube-proxy running on the node in question.
$ kubectl logs <kube-proxy-pod>
or ssh to the box and
$ docker log <kube-proxy-container>
You can also try to change the verbosity of the kube-proxy logs in the kube-proxy DaemonSet:
containers: here - command: | - /bin/sh | - -c \|/ - kube-proxy --v=9 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME} env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10 imagePullPolicy: IfNotPresent name: kube-proxy
Does your kube-proxy have enough resources in the node that it's running? You can also try changing the kube-proxy DaemonSet to give it more resources (CPU, memory)
containers: - command: - /bin/sh - -c - kube-proxy --v=2 --config=/var/lib/kube-proxy-config/config --hostname-override=${NODE_NAME} env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.15.10 imagePullPolicy: IfNotPresent name: kube-proxy resources: requests: cpu: 300m <== this instead of 100m
You can try enabling iptables logging on the node. Check if packets are getting dropped for some reason.
In the end this issue was caused by a Pod incorrectly configured such that the load balancer routing traffic to it:
selector:
matchLabels:
app: redis-cli
There were 5 nginx pods correctly receiving traffic and one utility Pod incorrectly receiving traffic and responding by refusing the connection as you would expect.
Thanks for responses.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.