简体   繁体   中英

Processing Huge Files In C#

I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:

    public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
    {
        var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
        int FindPoint = 0;
        int Results = 0;
        for (long i = 0; i < Stream.Length; i++)
        {
            if (Find[FindPoint] == Stream.ReadByte())
            {
                FindPoint++;
                if (FindPoint > Find.Length - 1)
                {
                    Results++;
                    FindPoint = 0;
                    Stream.Seek(-Find.Length, SeekOrigin.Current);
                    Stream.Write(Replace, 0, Replace.Length);
                }
            }
            else
            {
                FindPoint = 0;
            }
        }
        Stream.Close();
        return Results;
    }

Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example

4GB File starts with: ABCDEFGHIJKLMNOP

Your substring is: NOP

  1. You skip the length of the substring-1 and check against the last byte, so compare C to P
  2. It doesn't match, so the substring is not the first 3 bytes
  3. Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
  4. Compare F to P, doesn't match, F isn't in substring, skip 3
  5. Compare I to P, etc, etc

If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

Instead of reading file byte by byte read it by buffer:

buffer = new byte[bufferSize];            
currentPos = 0;
length = (int)Stream .Length;
while ((count = Stream.Read(buffer, currentPos, bufferSize)) > 0)
{
   currentPos += count;
   ....
}

Another, easier way of reading more than one byte at a time:

var Stream = new BufferedStream(new FileStream(File, FileMode.Open, FileAccess.ReadWrite));

Combining this with Saeed Amiri's example of how to read into a buffer, and one of the better binary find/replace algorithms should give you better results.

You should try using memory-mapped files . C# supports them starting with version 4.0.

A memory-mapped file contains the contents of a file in virtual memory.

Persisted files are memory-mapped files that are associated with a source file on a disk. When the last process has finished working with the file, the data is saved to the source file on the disk. These memory-mapped files are suitable for working with extremely large source files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM