涉及HTTP调用的Node.js性能优化

Question

I have a Node.js application which opens a file, scans each line and makes a REST call that involves Couchbase for each line. 我有一个Node.js应用程序，该应用程序打开一个文件，扫描每一行，并进行涉及Couchbase的REST调用。 The average number of lines in a file is about 12 to 13 million. 文件中的平均行数约为12至1300万。 Currently without any special settings my app can completely process ~1 million records in ~24 minutes. 目前，没有任何特殊设置，我的应用程序可以在约24分钟内完全处理约100万条记录。 I went through a lot of questions, articles, and Node docs but couldn't find out any information about following: 我经历了很多问题，文章和Node文档，但找不到有关以下内容的任何信息：

Where's the setting that says node can open X number of http connections / sockets concurrently ? 节点可以同时打开X个http连接/套接字的设置在哪里？ and can I change it? 我可以更改它吗？
I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this? 我必须规范文件处理，因为文件读取比REST调用快得多，因此过一会儿有太多打开的REST请求，并且阻塞了系统，并且内存不足。所以现在我读了1000行为REST调用完成这些操作，然后恢复它（我正在使用流上的暂停和恢复方法进行此操作）是否有更好的替代方法？
What all possible optimizations can I perform so that it becomes faster than this. 我可以执行所有可能的优化，以使其比此更快。 I know the gc related config that prevents from frequent halts in the app. 我知道与gc相关的配置，可以防止应用程序频繁停顿。
Is using "cluster" module recommended? 是否建议使用“集群”模块？ Does it work seamlessly? 它可以无缝工作吗？

Background: We have an existing java application that does exactly same by spawning 100 threads and it is able to achieve slightly better throughput than the current node counterpart. 背景：我们有一个现有的Java应用程序，该应用程序通过产生100个线程来完全相同，并且与当前的节点应用程序相比，能够实现稍微更好的吞吐量。 But I want to try node since the two operations in question (reading a file and making a REST call for each line) seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these... 但是我要尝试使用节点，因为有问题的两个操作（读取文件并为每行进行REST调用）对于节点应用程序来说似乎是完美的情况，因为它们在节点中都可以是异步的，而Java应用程序会阻止这些操作。 ..

Any help would be greatly appreciated... 任何帮助将不胜感激...

Answer 1

Generally you should break your questions on Stack Overflow into pieces. 通常，您应该将有关堆栈溢出的问题分解为多个部分。 Since your questions are all getting at the same thing, I will answer them. 由于您的问题都在同一件事上，我将回答它们。 First, let me start with the bottom: 首先，让我从底部开始：

We have an existing java application that does exactly same by spawning 100 threads ... But I want to try node since the two operations in question ... seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these. 我们有一个现有的Java应用程序，它通过产生100个线程来实现完全相同的工作...但是我想尝试节点，因为有问题的两个操作...对于节点应用程序来说似乎是完美的情况，因为它们都可以在节点中像Java一样异步应用阻止了这些呼叫。

Asynchronous calls and blocking calls are just tools to help you control flow and workload. 异步调用和阻塞调用只是帮助您控制流量和工作量的工具。 Your Java app is using 100 threads, and therefore has the potential of 100 things at a time. 您的Java应用程序正在使用100个线程，因此一次可能具有100个线程。 Your Node.js app may have the potential of doing 1,000 things at a time but some operations will be done in JavaScript on a single thread and other IO work will pull from a thread pool. 您的Node.js应用程序可能有可能一次执行1000项操作，但是某些操作将在JavaScript中的单个线程上完成，而其他IO工作将从线程池中提取。 In any case, none of this matters if the backend system you're calling can only handle 20 things at a time. 无论如何，如果您要调用的后端系统一次只能处理20件事情，那么这都不重要。 If your system is 100% utilized, changing the way you do your work certainly won't speed it up. 如果您的系统被100％使用，那么改变工作方式当然不会加快速度。

In short, making something asynchronous is not a tool for speed, it is a tool for managing the workload. 简而言之，使某些事物异步并不是提高速度的工具，而是管理工作量的工具。

Where's the setting that says node can open X number of http connections / sockets concurrently ? 节点可以同时打开X个http连接/套接字的设置在哪里？ and can I change it? 我可以更改它吗？

Node.js' HTTP client automatically has an agent, allowing you to utilize keep-alive connections. Node.js的HTTP客户端自动具有一个代理，允许您利用保持活动连接。 It also means that you won't flood a single host unless you write code to do so. 这也意味着除非编写代码，否则不会泛洪单个主机。 http.globalAgent.maxSocket=1000 is what you want, as mentioned in the documentation: http://nodejs.org/api/http.html#http_agent_maxsockets 如文档中所述，您需要的是http.globalAgent.maxSocket=1000 ： http : //nodejs.org/api/http.html#http_agent_maxsockets

I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this? 我必须规范文件处理，因为文件读取比REST调用快得多，因此过一会儿有太多打开的REST请求，并且阻塞了系统，并且内存不足。所以现在我读了1000行为REST调用完成这些操作，然后恢复它（我正在使用流上的暂停和恢复方法进行此操作）是否有更好的替代方法？

Don't use .on('data') for your stream, use .on('readable') . 不要在流中使用.on('data') ，而应使用.on('readable') 。 Only read from the stream when you're ready. 准备好后，才从流中读取。 I also suggest using a transform stream to read by lines . 我还建议使用转换流按行读取。

What all possible optimizations can I perform so that it becomes faster than this. 我可以执行所有可能的优化，以使其比此更快。 I know the gc related config that prevents from frequent halts in the app. 我知道与gc相关的配置，可以防止应用程序频繁停顿。

This is impossible to answer without detailed analysis of your code. 如果不仔细分析代码，这是不可能回答的。 Read more about Node.js and how its internals work. 阅读有关Node.js及其内部原理的更多信息。 If you spend some time on this, the optimizations that are right for you will become clear. 如果您花一些时间在此上，那么适合您的优化将变得清晰。

Is using "cluster" module recommended? 是否建议使用“集群”模块？ Does it work seamlessly? 它可以无缝工作吗？

This is only needed if you are unable to fully utilize your hardware. 仅当您无法充分利用硬件时才需要这样做。 It isn't clear what you mean by "seamlessly", but each process is its own process as far as the OS is concerned, so it isn't something I would call "seamless". 不清楚“无缝”是什么意思，但是就OS而言，每个进程都是其自己的进程，因此我不称其为“无缝”。

Answer 2

By default, node uses a socket pool for all http requests and the default global limit is 5 concurrent connections per host (these are re-used for keepalive connections however). 默认情况下，节点对所有http请求都使用一个套接字池，默认的全局限制是每台主机5个并发连接（但是这些连接被重新用于保持连接）。 There are a few ways around this limit: 有几种方法可以解决此限制：

Create your own http.Agent and specify it in your http requests: 创建您自己的http.Agent并在您的http请求中指定它：
```
 var agent = new http.Agent({maxSockets: 1000}); http.request({ // ... agent: agent }, function(res) { }); 
```
Change the global/default http.Agent limit: 更改全局/默认http.Agent限制：
```
 http.globalAgent.maxSockets = 1000; 
```
Disable pooling/connection re-use entirely for a request: 完全禁用请求的池/连接重用：
```
 http.request({ // ... agent: false }, function(res) { }); 
```

涉及HTTP调用的Node.js性能优化

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-05-27 02:20:03

解决方案2
2 2014-05-27 02:17:52

涉及HTTP调用的Node.js性能优化

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-05-27 02:20:03

解决方案2 2 2014-05-27 02:17:52

解决方案1
4 已采纳 2014-05-27 02:20:03

解决方案2
2 2014-05-27 02:17:52