简体繁体 English

使用Node.JS spawn生成多个文件的bash cli

[英]Using Node.JS spawn to spawn bash clis for several files

原文 2016-05-30 22:59:26 3 1 javascript/ node.js/ multithreading

I am creating a program in Node.JS that extract pdf text using the command-line utility pdftotext by creating child_process.spawn for each file. 我正在Node.JS中创建一个程序，该程序使用命令行实用程序pdftotext通过为每个文件创建child_process.spawn来提取pdf文本。 I would like to know if this process is CPU heavy and if it is possible thousands of people to use without breaks anything. 我想知道此过程是否占用大量CPU，是否有可能数千人使用而不会中断任何操作。

Is create a child_process is heavy? 是创建一个child_process很重吗？ If pdftotext is not multithreading, how can I scale? 如果pdftotext不是多线程的，我该如何扩展？ Do i need load balancing? 我需要负载平衡吗？

Thanks. 谢谢。

1 个解决方案

Let's break this down a bit: 让我们分解一下：

I would like to know if this process is CPU heavy 我想知道此过程是否占用大量CPU

I am not sure how CPU intense pdftotext is for a single file. 我不确定单个文件的CPU密集度pdftotext有多大。 That would also depend on how big each file is, but generally speaking and since the action of extracting PDF to text has no asynchronous work and is CPU bound, I would imagine the process to be CPU heavy, specially with lots of load. 这也将取决于每个文件的大小，但是通常来说，由于将PDF提取为文本的操作没有异步工作并且受CPU限制，因此我可以想象该过程将占用大量CPU，特别是负载很大。

and if it is possible thousands of people to use without breaks anything. 如果有可能数千人使用而不会中断任何事情。

Spawning a new process for every single file or on every single request is generally not a good idea. 为每个文件或每个请求生成新进程通常不是一个好主意。 Spawning a process is an expensive operation that requires a lot of memory. 产生一个进程是一项昂贵的操作，需要大量内存。 Having thousands of people using your service at the same time would require thousands of processes to be open simultaneously on your server which would cause memory to choke and your server would max at a certain limit and fail after that. 如果同时有成千上万的人使用您的服务，则需要在服务器上同时打开成千上万个进程，这将导致内存阻塞，并且服务器将在某个限制下达到最大值，然后发生故障。

Is create a child_process is heavy? 是创建一个child_process很重吗？ If pdftotext is not multithreading, how can I scale? 如果pdftotext不是多线程的，我该如何扩展？ Do i need load balancing? 我需要负载平衡吗？

As mentioned, spawning a new process is never a cheap operation. 如前所述，产生新流程绝不是一项廉价的操作。 It requires memory and resources. 它需要内存和资源。

Every file will run in a separate process. 每个文件将在单独的进程中运行。 Weather pdftotext is implemented to open a single or multiple threads in a process is irrelevant here, either way the process with all it's threads will be competing for machine resources with other processes. Weather pdftotext的实现是在一个进程中打开一个或多个线程是无关紧要的，无论哪种方式，具有所有线程的进程将与其他进程争夺机器资源。 Of course it is beneficial if it is implemented in a way that divides work among different threads and can execute in parallel as this makes it faster, however what you would be more concerned about is how long it takes to extract text from a single file ie how long the process spends executing. 当然，如果将其实现为在不同线程之间划分工作并可以并行执行的方式是有利的，因为这可以使其更快地进行，但是您将更加担心的是从一个文件中提取文本需要多长时间，即流程花费多少时间执行。

If you are to run this as a service, you would need to benchmark, optimize and for sure depending on the load you want to support and benchmark results, have to load balance between a few high end machines. 如果要将其作为服务运行，则需要进行基准测试，优化，并确保根据要支持的负载和基准测试结果，必须在几台高端计算机之间进行负载平衡。

I hope I managed to answer some of your questions. 希望我能回答您的一些问题。