简体   繁体   中英

Azure aks node stops egress traffic to specific ip

We have an application hosted on an azure aks kubernetes cluster. It is basically a web application which uses a java back-end with an nginx container set up as a reverse proxy to direct http traffic. The majority of traffic is routed to the back end services but we direct a couple of end-points back to our on-premises instance of the application (using a public domain).

This set up was working very well for about a week under a pretty solid traffic load, then abruptly stopped proxying traffic to our on-prem resources. We initially thought that someone had changed a firewall setting but further testing revealed that the problem was isolated to the single node which hosted the nginx proxy.

I was able to ssh into the node and attempts to reach our on-prem server using the public http address failed. However, I can access any other site on the internet, including sites we host on other ip addresses. If I ssh to another node, I can reach our on prem hosted sites without issue. It seems that our node is blocking or being blocked from accessing our site but we can find no mechanism that is responsible. No firewall or configuration changes have taken place afaik. Azure aks documentation says that there is no default limits on http traffic egress. Has anyone come across this issue?

Here is a block from our nginx configuration which proxies requests to our local instance:

    location /civix/content/oic {
        proxy_pass $on_prem_site;
        proxy_set_header Host $server_name;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_intercept_errors on;
    }

Since you are able to connect to other sites from the misbehaving node, I'm going to assume that this is not an issue with resolving the DNS name and you are simply unable to connect to the on-prem application following successful DNS lookup. Any additional details on the failure to reach the on-prem app would be helpful.

For immediate feedback, try turning off proxy_intercept_errors setting in nginx to see if that gives you any more useful information.

Check whether the on-prem application is rate-limiting/blocking the IP address associated with the failing node's egress. If you don't have access to the on-prem app, try moving the ngingx proxy service to a new node (use node affinity to target a "good" node - https://docs.microsoft.com/en-us/azure/aks/operator-best-practices-advanced-scheduler#control-pod-scheduling-using-node-selectors-and-affinity ).

Traffic will likely start flowing again, that will validate the theory while you troubleshoot what's blocking on the side of your on-prem app.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM