How to get zero downtime with Socket.io / Node.js server?

Question

I have a Node.js web server running with Socket.io. I found that if one error happens in the script, the entire server crashes. So I'm trying to find a solution to keep the server up and running in cases like this when the app goes into Production. I found one answer that seemed promising, but doesn't solve my particular problem when I tried implementing it on my code: How do I prevent node.js from crashing? try-catch doesn't work

EDIT:

What I fixed so far: I now have PM2 to auto-restart script upon crash, and I now have Redis set up and have my user session data stored in it.

My code is currently set up like this:

EDIT #2: After studying and working on the code all day and edited the code slightly a second time to include " sticky-session " logic. After editing code, there are no longer strange sockets connection every 1 second and it seems like (I'm not completely sure though) the sockets are all in sync with workers. When the script crashes, the app (not PM2) spawns a new process, which seems good. However when a worker crashes, users still have to refresh the page again to refresh their session and get new sockets, which is a big problem...

var fs = require('fs');
  https = require('https'),
  express = require('express'),
  options = {
    key: fs.readFileSync('/path/to/privkey.pem'),
    cert: fs.readFileSync('/path/to/fullchain.pem')
  },
  cluster = require('cluster'), // not really sure how to use this
  net = require('net'), // not really sure what to do here
  io = require('socket.io'),
  io_redis = require('socket.io-redis'), // not really sure how to use this
  sticky = require('sticky-session'),
  os = require('os');
  var numCPUs = os.cpus().length;
  var server = https.createServer(options,app, function(req, res) {
    res.end('worker: '+cluster.worker.id);
  });

if(!sticky.listen(server, 3000) {
  // Master code
  for(var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
  server.once('listening', function() {
    console.log('server started on port 3000');
  });
}
else {
  // Worker code
  var 
    io = io(server),
    io.adapter(io_redis({host: 'localhost', port: 6379})),
    getUser = require('./lib/getUser'),
    loginUser = require('./lib/loginUser'),
    authenticateUser = require('./lib/authenticateUser'),
    client = require('./lib/redis'); // connect to redis

  client.on("error", function(err) {
    console.log("Error "+err);
  });

  io.on('connection', function(socket){
    // LOTS OF SOCKET EVENTS / REDIS USER SESSION MANAGEMENT / APP
  });

}

I tried using "cluster", but I'm not sure how to get it working properly, since it involves multiple "workers", and I believe the sockets get mixed up between. I'm not even sure what parts of my code ("require" functions, etc) go in which "cluster" code blocks (Master/Worker), or how to keep the sockets in sync. Something just isn't right.

I'm assuming I need to use npm package socket.io-redis and/or sticky-session to keep the sockets in sync? (not sure how to implement this). Unfortunately, there just aren't any good examples on the internet or in the books I'm reading for clustering socket.io with node.js

Can someone provide a basic code example on which parts of my code go where, or how to implement things? I would greatly appreciate it. The goals are:

1) If the server (node cluster process) crashes, the sockets should still work after restart (or another worker spawns).

For example, if two users (two sockets) are having a private message conversation and then a crash happens, the messages should still be delivered after PM2 auto-restarts (spawns a new cluster process) after crash. The problem I have: If the server crashes, messages stop getting sent to users even after an auto-restart.

2) Sockets should all be in sync together with different cluster processes.

Answer 1

How to get zero downtime with …

You don't.

It's simply not possible with anything. You're asking the wrong questions. Try these:

How do I catch and handle errors I can predict?
How do I gracefully fail when there are errors I cannot predict?
How can I usefully separate errors in my application vs. errors in how clients interact with it?
How can I build a distributed system?
How do I deploy and scale a system with fault tolerance in-mind?
I have [single point of failure XYZ], how do I distribute [XYZ] to remove it?
What systems monitoring is useful for [some technology]?
How do I set up automation for [recurring problem X]?

etc. etc.

How to get zero downtime with Socket.io / Node.js server?

Question

1 answers

solution1
0 2018-02-01 00:37:39

How to get zero downtime with Socket.io / Node.js server?

Question

1 answers

solution1 0 2018-02-01 00:37:39

solution1
0 2018-02-01 00:37:39