简体   繁体   中英

C#: poor performance with multithreading with heavy I/O

I've written an application in C# that moves jpgs from one set of directories to another set of directories concurrently (one thread per fixed subdirectory). The code looks something like this:

        string destination = "";
        DirectoryInfo dir = new DirectoryInfo("");
        DirectoryInfo subDirs = dir.GetDirectories();
        foreach (DirectoryInfo d in subDirs)
        {
            FileInfo[] files = subDirs.GetFiles();
            foreach (FileInfo f in files)
            {
                f.MoveTo(destination);
            }
        }

However, the performance of the application is horrendous - tons of page faults/sec. The number of files in each subdirectory can get quite large, so I think a big performance penalty comes from a context switch, to where it can't keep all the different file arrays in RAM at the same time, such that it's going to disk nearly every time.

There's a two different solutions that I can think of. The first is rewriting this in C or C++, and the second is to use multiple processes instead of multithreading.

Edit: The files are named based on a time stamp, and the directory they are moved to are based on that name. So the directories they are moved to would correspond to the hour it was created; 3-27-2009/10 for instance.

We are creating a background worker per directory for threading.

Any suggestions?

Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.

If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.

Reconsidered answer

I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.

However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)

I suggest the following plan of attack:

  • Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
  • Experiment with different numbers of threads, with a queue of directories still to process.
  • Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.

Original answer

Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.

It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.

You may be interested in a benchmark ( description and initial results ) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (ie it's hardly doing any CPU work) the best results are always with a single thread.

If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.

It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.

Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.

性能问题来自硬盘驱动器,没有必要用C / C ++或多个进程重做所有东西

Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (eg 'loading' executable code) - they're not a bad thing per se.

If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.

Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.

如果GetFiles()确实返回了大量数据,您可以编写一个枚举器,如:

IEnumerable<string> GetFiles();

So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.

And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).

If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM