简体   繁体   中英

Efficiently searching a massive file for a string in C#

I am building an app that scans files by comparing hashes. I need to search over 1GB of hashes for the hash of a file. I found other solutions for this, such as Aho-Corasick, but it was slower than File.ReadLines(file).Contains(str) .

This is the code that is the fastest so far, using File.ReadLines . It takes about 8 seconds to scan one file, versus around 2 minutes to scan one file using Aho-Corasick. I cannot read the entire hash file into memory for obvious reasons.

IEnumerable<DirectoryInfo> directories = new DirectoryInfo(scanPath).EnumerateDirectories();
IEnumerable<FileInfo> files = new DirectoryInfo(scanPath).EnumerateFiles();

FileInfo hashes = new FileInfo(hashPath);
await Task.Run(() =>
{
    IEnumerable<string> lines = File.ReadLines(hashes.FullName);
    
    foreach (FileInfo file in files) {
        if (!AuthenticodeTools.IsTrusted(file.FullName))
        {
            string hash = getHash(file.FullName);
            if (lines.Contains(hash)) flaggedFiles.Add(file.FullName);
        }
        filesScanned += 1;
    }
});
foreach (DirectoryInfo directory in directories)
{
    await scan(directory.FullName, hashPath);
    directoriesScanned += 1;
}

Edit: Per request, here are examples of the file's content:

5c269c9ec0255bbd9f4e20420233b1a7
63510b1eea36a23b3520e2b39c35ef4e
0955924ebc1876f0b849b3b9e45ed49d

They are MD5 hashes.

As the hashes are fixed at 32 hex digits (16 bytes), they should be stored in binary format as such, with no spaces. We can do a straight seek on each hash with simple multiplication.

If we then sort the hashes in the file in order, we can speed this up by doing a binary search for each hash.

Ordering can be done using the CompareHashes function below as a compare function.


Once we have done that we can do a binary search.

Binary search is a simple algorithm that searches through a sorted list. It has O (log 2 n) complexity, so, for the number of hashes you have, it would only require at most around 25 lookups. The algorithm is as follows:

  1. Start in the middle.
  2. If the item we're looking for is there then good.
  3. If it is earlier, change the high point to search to be one before this one. Divide difference by two, and loop back to step 2.
  4. If it is later, change the low point to search to be one after this one. Divide difference by two, and loop back to step 2.
  5. If we get to the last one, then we can't find the item.

(I've lifted and modified some of the code from ArraySortHelper in.Net Framework for this.)

public static bool ContainsHash(FileStream hashFile, byte[] hash)
{
    const long hashSize = 16;
    var curHash = new byte[hashSize];
    long lo = 0;
    long hi = hashFile.Length / hashSize - 1;
    while (lo <= hi)
    {
        long i = lo + ((hi - lo) >> 1);
        hashFile.Read(curHash, i * hashSize, hashSize);

        int order = CompareHashes(curHash, hash);
 
        if (order == 0) return true;
        if (order < 0)
        {
            lo = i + 1;
        }
        else
        {
            hi = i - 1;
        }
    }
    return false;
}

public static int CompareHashes(byte[] b1, byte[] b2)
{
    var comp = 0;
    for (int i = 0; i < b1.Length; i++)
    {
        comp = b1[i].CompareTo(b2[i]);
        if(comp != 0) return comp;
    }
    return comp;
}

We only need to open the file of hashes once, and pass to the function the FileStream for the hashes, plus aa hash to compare.


I may have some slight errors as I have not tested it. I would love for others to please test and edit this answer.

It seems like you will process all of the files in the directory so why don't you change your approach. First, fill a dictionary of all of the files that aren't trusted with something like:

var hashDict = files.Where(fi => !IsTrusted(fi.FullName))
                    .ToDictionary(fi=>fi.FullName,fi=>getHash(fi.FullName));

Now that you have a list of the hashes to check, pass them into a method that gets the flagged files.

using(var stream = File.OpenRead(hashPath) )
{
    var flaggedFiles = GetHashesInStream(stream, hashDict);
    // Do whatever you need to do with the list.
}

Here is the search method:

private static List<string> GetFilesWithMatchingHashes(Stream s, Dictionary<string,string> hashes)
{
    var results = new List<string>();
    var bufsize = (1024 * 1024 / 34)*34; // Each line should be 32 characters for the hash and 2 for cr-lf
                                         // Adjust if this isn't the case
    var buffer = new byte[bufsize];
    s.Seek(0, SeekOrigin.Begin);

    var readcount = bufsize;
    var keyList = hashes.Keys.ToList();
    while (keyList.Count > 0 && (readcount = s.Read(buffer, 0, bufsize)) > 0)
    {
        var str = Encoding.ASCII.GetString(buffer, 0, readcount);
        for (var i = keyList.Count - 1; i >= 0; i--)
        {
            var k = keyList[i];
            if (str.Contains(hashes[k]))
            {
                results.Add(k);
                keyList.RemoveAt(i);
            }
        }
    }
    return results; // This should contain a list of the files with found hashes.
}

The benefit of this solution is that you will only scan through the file once. I did some testing where I searched for the last hash in a file of 1,020,000,000 bytes. Just searching for one hash was more than twice as fast than your readlines method. Getting them all at once should be much faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM