简体   繁体   中英

Infrequent NGINX Errors/Timeouts when proxying to a Node/Express upstream (upstream timed out while connecting to upstream)

We have an Express web API deployed that gets a decent but still relatively small amount of traffic (around 10 requests per second on average) that is running on an EC2 ubuntu server that is proxied to through NGINX. Every so often, a request will hang, and if the client waits long enough a line containing the below will be output to the NGINX error log:

upstream timed out (110: Connection timed out) while connecting to upstream

I've already tried the suggested solution here but it seems to have had no effect. This only happens to our knowledge about 1-3 times every minute, but I'm only going off of these logs. If the client refreshes their page or navigates away before the request times out, there seems to be no record.

The error message obviously indicates that there's something wrong connecting to the upstream server, but why would this happen so infrequently? There is also absolutely no pattern in the URLs that are causing this issue and the application being proxied to remains available to my knowledge. Here's an idea of our NGINX config:

user www-data;
worker_processes 4;
pid /run/nginx.pid;
events {
worker_connections 10000;
}
worker_rlimit_nofile 25000;

http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
client_max_body_size 15M;
include /etc/nginx/mime.types;
include /etc/nginx/conf.d/ *.conf; //Added space before star because so formatting was turning it into a comment


include /etc/nginx/sites-enabled/ *;
default_type application/octet-stream;

log_format nginx_json '{ "timestamp": "$time_local", '
                      ' "request_ip": "$remote_addr", '
                      ' "request_user": "$remote_user", '
                      ' "request_bytes_sent": "$bytes_sent", '
                      ' "response_status": "$status", '
                      ' "request": "$request", '
                      ' "request_method": "$request_method", '
                      ' "http_referrer": "$http_referer", '
                      ' "http_user_agent": "$http_user_agent", '
                      ' "request_id": "$request_id", '
                      ' "server_name": "$server_name",'
                      ' "response_time": "$upstream_response_time" }';

access_log /var/log/nginx/access.log nginx_json;
error_log /var/log/nginx/error.log;

gzip on;
gzip_disable "msie6";
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:ECDHE-RSA-AES128-GCM-SHA256:AES256+EECDH:DHE-RSA-AES128-GCM-SHA256:AES256+EDH:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4";
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_dhparam /etc/ssl/certs/dhparam.pem;

resolver 127.0.0.1 valid=30s;

server {
 listen 80;
 server_name a.mysite.com;
 return 301 https://$server_name$request_uri;
}

 server {
     listen 443 ssl;
     server_name a.mysite.com;
     add_header Strict-Transport-Security "max-age=31536000";
     add_header Cache-Control no-cache;
     location /api {
         proxy_pass http://1.2.3.4:3001;
         proxy_set_header Host $host;
         proxy_set_header X-Request-Id $request_id;
         proxy_set_header Connection "";
         proxy_http_version 1.1;
     }
     location /ui2 {
         set $uiHost https://abc.cloudfront.net/ui2/index.html?v=1503438694163;
         proxy_pass $uiHost;
     }
     location / {
         set $uiHost https://abc.cloudfront.net/ui/index.html?v=1504012942606;
         proxy_pass $uiHost;
     }
     ssl_certificate /path/to/certificate;
     ssl_certificate_key /path/to/certificate/key;
 }

The server blocks at the bottom of this are repeated for multiple subdomains, with the /api path usually pointing to the same server on different ports. One subdomain gets the bulk of the traffic. The upstream server (1.2.3.4 in the example) is configured with EC2 security groups to only allow access from the NGINX server. Obviously the error message indicates that there could be something wrong with the upstream server that's running the Express app, but nothing in our logs indicates this is occurring.

A few final things to note:

  • I recently increased the worker_connections from 768 to 10000 which seems to have made the issue occur slightly less frequently. We never reach anywhere close to the connection limit however, and the connections are being closed.
  • After this increase, each time a reload is done on NGINX, we do not get any of these errors for around 10 minutes. This is the primary reason for me thinking NGINX is the culprit, but I'm no expert.
  • I did see a previous post somewhere when Googling around that the proxy_set_header Host $host; statement could be causing this, which didn't make much sense to me but something to think about. I have not tested removing this.
  • The API server with the express app always appears to be performing normally, and that server is not under heavy load from what we can tell
  • This issue does not occur on the proxies to cloudfront

Anyone have anything obvious or ideas about what to investigate further? Could really use some help here as we're pretty lost.

Update: I added some extra variables to the logs as recommended and was able to tie the error to an access log. Here's the relevant variables:

{

    "offset": 64270628,
    "response_status": "504",
    "upstream_header_time": "60.001",
    "input_type": "log",
    "source": "/var/log/nginx/access.log",
    "request_method": "GET",
    "http_user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "@timestamp": "2017-08-30T15:29:15.981Z",
    "upstream_connect_time": "60.001",
    "request_user": "-",
    "response_time": "60.001",
    "request_bytes_sent": "345",
    "request_id": "90a41e2224cc4b2c1d3c23d544b9146c",
    "timestamp": "30/Aug/2017:15:29:15 +0000"
  }

IN MY CASE , I had the exact same issue and with an NGINX reverse proxy inside a private subnet & NAT Gateway associated with it.

The problem was that the allocated number of connections available for the NGINX instance -which was set on the NAT Gateway- was not enough and did not autoscale.

It was not the issue of NGINX nor the backend API behind the NGINX.

This setting was, in the case of GCP, modifiable at CloudNAT console. For the specifics, please refer to this answer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM