Node.js performance optimization involving HTTP calls

Question

I have a Node.js application which opens a file, scans each line and makes a REST call that involves Couchbase for each line. The average number of lines in a file is about 12 to 13 million. Currently without any special settings my app can completely process ~1 million records in ~24 minutes. I went through a lot of questions, articles, and Node docs but couldn't find out any information about following:

Where's the setting that says node can open X number of http connections / sockets concurrently ? and can I change it?
I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this?
What all possible optimizations can I perform so that it becomes faster than this. I know the gc related config that prevents from frequent halts in the app.
Is using "cluster" module recommended? Does it work seamlessly?

Background: We have an existing java application that does exactly same by spawning 100 threads and it is able to achieve slightly better throughput than the current node counterpart. But I want to try node since the two operations in question (reading a file and making a REST call for each line) seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these...

Any help would be greatly appreciated...

Answer 1

Generally you should break your questions on Stack Overflow into pieces. Since your questions are all getting at the same thing, I will answer them. First, let me start with the bottom:

We have an existing java application that does exactly same by spawning 100 threads ... But I want to try node since the two operations in question ... seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these.

Asynchronous calls and blocking calls are just tools to help you control flow and workload. Your Java app is using 100 threads, and therefore has the potential of 100 things at a time. Your Node.js app may have the potential of doing 1,000 things at a time but some operations will be done in JavaScript on a single thread and other IO work will pull from a thread pool. In any case, none of this matters if the backend system you're calling can only handle 20 things at a time. If your system is 100% utilized, changing the way you do your work certainly won't speed it up.

In short, making something asynchronous is not a tool for speed, it is a tool for managing the workload.

Where's the setting that says node can open X number of http connections / sockets concurrently ? and can I change it?

Node.js' HTTP client automatically has an agent, allowing you to utilize keep-alive connections. It also means that you won't flood a single host unless you write code to do so. http.globalAgent.maxSocket=1000 is what you want, as mentioned in the documentation: http://nodejs.org/api/http.html#http_agent_maxsockets

I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this?

Don't use .on('data') for your stream, use .on('readable') . Only read from the stream when you're ready. I also suggest using a transform stream to read by lines .

What all possible optimizations can I perform so that it becomes faster than this. I know the gc related config that prevents from frequent halts in the app.

This is impossible to answer without detailed analysis of your code. Read more about Node.js and how its internals work. If you spend some time on this, the optimizations that are right for you will become clear.

Is using "cluster" module recommended? Does it work seamlessly?

This is only needed if you are unable to fully utilize your hardware. It isn't clear what you mean by "seamlessly", but each process is its own process as far as the OS is concerned, so it isn't something I would call "seamless".

Answer 2

By default, node uses a socket pool for all http requests and the default global limit is 5 concurrent connections per host (these are re-used for keepalive connections however). There are a few ways around this limit:

Create your own http.Agent and specify it in your http requests:

 var agent = new http.Agent({maxSockets: 1000}); http.request({ // ... agent: agent }, function(res) { });

Change the global/default http.Agent limit:
```
 http.globalAgent.maxSockets = 1000; 
```

Disable pooling/connection re-use entirely for a request:

 http.request({ // ... agent: false }, function(res) { });

Node.js performance optimization involving HTTP calls

Question

2 answers

solution1
4 ACCPTED 2014-05-27 02:20:03

solution2
2 2014-05-27 02:17:52

Node.js performance optimization involving HTTP calls

Question

2 answers

solution1 4 ACCPTED 2014-05-27 02:20:03

solution2 2 2014-05-27 02:17:52

solution1
4 ACCPTED 2014-05-27 02:20:03

solution2
2 2014-05-27 02:17:52