简体   繁体   中英

Searching Subdirectories in C#

I have a list of file names, and I want to search a directory and all its subdirectories. These directories contain about 200,000 files each. My code finds the the file but it takes about 20 minutes per file. Can someone suggest a better method?

Code Snippet

String[] file_names = File.ReadAllLines(@"C:\file.txt");
foreach(string file_name in file_names) 
{
    string[] files = Directory.GetFiles(@"I:\pax\", file_name + ".txt",
                                        SearchOption.AllDirectories);
    foreach(string file in files)
    {
        System.IO.File.Copy(file, 
                            @"C:\" + 
                            textBox1.Text + @"\N\O\" + 
                            file_name + 
                            ".txt"
                            );
    }

}

If you're searching for multiple files in the same directory structure, you should find all the files in that directory structure once, and then search through them in memory. There's no need to go to the file system again and again.

EDIT: There's an elegant way of doing this, with LINQ - and the less elegant way, without. Here's the LINQ way:

using System;
using System.IO;
using System.Linq;

class Test
{
    static void Main()
    {
        // This creates a lookup from filename to the set of 
        // directories containing that file
        var textFiles = 
            Directory.GetFiles("I:\\pax", "*.txt", SearchOption.AllDirectories)
                     .ToLookup(file => Path.GetFileName(file),
                               file => Path.GetDirectoryName(file));

        string[] fileNames = File.ReadAllLines(@"c:\file.txt");
        // Remove the quotes for your real code :)
        string targetDirectory = "C:\\" + "textBox1.Text" + @"\\N\\O\\";

        foreach (string fileName in fileNames)
        {
            string tmp = fileName + ".txt";
            foreach (string directory in textFiles[tmp])
            {
                string source = Path.Combine(directory, tmp);
                string target = Path.Combine(targetDirectory, tmp);
                File.Copy(source, target);                                       
            }
        }
    }
}

Let me know if you need the non-LINQ way. One thing to check before I do so though - this could copy multiple files over the top of each other. Is that really what you want to do? (Imagine that a.txt exists in multiple places, and "a" is in the file.)

You're probably better off trying to load all the file paths into memory. Call Directory.GetFiles() once, and put the results into a HashSet<String> . Then do lookups on the HashSet. This will work fine if you have enough memory. It would be easy to try.

If you run out of memory, you'll have to be smarter, like by using a buffer cache. The easiest way to do this is to load all the file paths as rows into a database table, and have the query processor do the work of managing the buffer cache for you.

Here's code for the first:

String[] file_names = File.ReadAllLines(@"C;\file.txt");
HashSet<string> allFiles = new HashSet<string>();
string[] files = Directory.GetFiles(@"I:\pax\", file_name + ".txt", SearchOption.AllDirectories);
foreach (string file in files)
{
    allFiles.Add(file);
}

foreach(string file_name in file_names)
{
    String file = allFiles.FirstOrDefault(f => f == file_name);
    if (file != null)
    {
        System.IO.File.Copy(file, @"C:\" + textBox1.Text + @"\N\O\" + file_name + ".txt");
    }
}

You could be even smarter on memory usage by traversing the directories one at a time and adding the resulting file array to the hashset. That way all the filenames would have to exist in a big String[].

You are performing a recursive GetFiles() over and over again, and it probably is the most expensive part.

Try to load all files in to memory, and do your own matching on that.

Note that it will be more efficient to load 1 folder at a time, and search that for all file_name in file_names , and repeat that for the next folder.

Scanning a directory structure is an IO intensive operation, whatever you do, the first GetFiles() call will take the majority of time, by the end of the first call probably most of the file information will be in the file system cache and second call will return in no time when compared to the first call (depending on your free memory and file system cache size).

Probably your best option is turning on indexing on the file system and somehow using it; Querying the Index Programmatically

At a glance it appears that there are .NET APIs to call the Windows Indexing service... provided the machine you're using has indexing enabled (and I'm also unsure if the aforementioned service refers to the XP-era Indexing Service or the Windows Search indexing service).

Google Search

One possible lead

Another

Try using LINQ to query the filesystem. Not 100% sure of performance but it is really easy to test.

var filesResult = from file in new DirectoryInfo(path).GetFiles("*.txt", SearchOption.AllDirectories)
                  where file.Name = filename
                  select file;

Then just do whatever you want with the result.

The Linq answer may run into problems, because it loads all the file names into memory before it starts selecting from them. Generally, you might want to load the contents of a single directory at a time, to reduce memory pressure.

However, for a problem like this, you might want to go up one level in the problem formulation. If this is a query you do often, then you could build something that uses a FileSystemListener to listen for changes in the top directory and all directories below it. Prime it on start-up by walking all the directories and building them into a Dictionary<> or HashSet<>. (Yes, this has the same memory problem as the Linq solution). Then, when you get file add/delete/rename modifications, update the dictionary. That way, each individual query can be answered very quickly.

If this is queries from a tool that's invoked a lot, you probably want to build the FileSystemWatcher into a service, and connect to / query that service from the actual tool that needs to know, so that the file system information can be built up once, and re-used for the lifetime of the service process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM