简体繁体 English

多台计算机-同时处理多个文件？

[英]Multiple Machines — Process Many Files Concurrently?

原文 2010-12-14 00:35:09 2 2 linux/ networking/ hardware/ hard-drive

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file). 我需要同时处理大量文件（成千上万个不同文件，每个文件平均大小为2MB）。

All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. 所有信息都存储在一个（1.5TB）网络硬盘上，大约30台不同的计算机可以访问（读取）这些信息。 For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed). 为了提高效率，每台机器将读取（和处理）不同的文件（有成千上万的文件需要处理）。

Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. 在从1.5TB硬盘驱动器上的“传入”文件夹中读取文件后，每台机器都将处理该信息，并准备将处理后的信息输出回1.5TB驱动器上的“已处理”文件夹中。 the processed information for every file is of roughly the same average size as the input files (about ~2MB per file). 每个文件的处理信息的平均大小与输入文件的大小大致相同（每个文件约2MB）。

Are there any 'do' and 'donts' when one is building such an operation? 当人们进行这样的手术时，有什么“做”和“不要”？ is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time? 拥有30台左右的计算机同时读取（或写入）同一网络驱动器是否有问题？ (note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). （请注意：现有文件只会被读取，不会被追加/写入；新文件将从头开始创建，因此不存在对同一文件进行多次访问的问题...）。 Are there any bottlenecks that I should expect? 我应该期待瓶颈吗？

(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters) （如果很重要，我将在所有计算机上使用Linux，Ubuntu 10.04 LTS）

2 个解决方案

Things you should think about: 您应该考虑的事情：

If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive. 如果对每个文件进行的处理都很简单，那么您真正的瓶颈不是读取的并行文件的数量，而是硬盘驱动器的功能。

Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much. 除非处理花费很长时间（例如，每个文件花费几秒钟），否则您将超过一个点，在该点添加更多进程只会减慢爬网的速度，因为每个进程都在读取和写入结果，而磁盘只能非常。

Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down. 尝试最小化磁盘访问：例如，在其他进程正在下载时下载文件并在本地生成结果，并在磁盘负载下降时将结果发送回去。

The more I write the more it boils down to how much processing needs to be done for each file. 我写的越多，就可以归结为每个文件需要完成多少处理。 If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference. 如果是简单的解析，则花费几毫秒，一台计算机或30台计算机的时间几乎没有什么不同。

You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time. 您需要注意，两个工作进程不会同时接（并尝试做）同一工作。

Unfortunately, NFS filesystems don't have semantics that allow you to easily do that. 不幸的是，NFS文件系统没有让您轻松做到这一点的语义。

So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it. 因此，我建议使用诸如Gearman和生产者/消费者模型之类的东西，其中一个过程将工作分配给有能力的人。

Another possibility is to have a database (eg mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves. 另一种可能性是拥有一个包含所有任务表的数据库（例如mysql），并让进程以原子方式自动“声明”任务。

But all of this is only worthwhile if your processes are mostly CPU-bound. 但是，只有当您的进程主要受CPU限制时，所有这些都是值得的。 If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work. 如果您试图通过使用多个客户端来从NAS中获得更多的IO带宽（或操作），它将无法正常工作。

I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it). 我假设您将在这里至少运行千兆以太网（否则可能不值得）。

Have you tried running multiple processes on the same machine? 您是否尝试过在同一台计算机上运行多个进程？