简体   繁体   中英

Multiple Machines — Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).

All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).

Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).

Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time? (note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...). Are there any bottlenecks that I should expect?

(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)

Things you should think about:

If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.

Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.

Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.

The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.

You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.

Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.

So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.

Another possibility is to have a database (eg mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.

But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.

I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).

Have you tried running multiple processes on the same machine?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM