Processing Huge Files In C#

Question

I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:

    public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
    {
        var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
        int FindPoint = 0;
        int Results = 0;
        for (long i = 0; i < Stream.Length; i++)
        {
            if (Find[FindPoint] == Stream.ReadByte())
            {
                FindPoint++;
                if (FindPoint > Find.Length - 1)
                {
                    Results++;
                    FindPoint = 0;
                    Stream.Seek(-Find.Length, SeekOrigin.Current);
                    Stream.Write(Replace, 0, Replace.Length);
                }
            }
            else
            {
                FindPoint = 0;
            }
        }
        Stream.Close();
        return Results;
    }

Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

Answer 1

Part of the problem may be that you're reading the stream one byte at a time. Try reading larger chunks and doing a replace on those. I'd start with about 8kb and then test with some larger or smaller chunks to see what gives you the best performance.

Answer 2

There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example

4GB File starts with: ABCDEFGHIJKLMNOP

Your substring is: NOP

You skip the length of the substring-1 and check against the last byte, so compare C to P
It doesn't match, so the substring is not the first 3 bytes
Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
Compare F to P, doesn't match, F isn't in substring, skip 3
Compare I to P, etc, etc

If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

Answer 3

Instead of reading file byte by byte read it by buffer:

buffer = new byte[bufferSize];            
currentPos = 0;
length = (int)Stream .Length;
while ((count = Stream.Read(buffer, currentPos, bufferSize)) > 0)
{
   currentPos += count;
   ....
}

Answer 4

Another, easier way of reading more than one byte at a time:

var Stream = new BufferedStream(new FileStream(File, FileMode.Open, FileAccess.ReadWrite));

Combining this with Saeed Amiri's example of how to read into a buffer, and one of the better binary find/replace algorithms should give you better results.

Answer 5

You should try using memory-mapped files . C# supports them starting with version 4.0.

A memory-mapped file contains the contents of a file in virtual memory.

Persisted files are memory-mapped files that are associated with a source file on a disk. When the last process has finished working with the file, the data is saved to the source file on the disk. These memory-mapped files are suitable for working with extremely large source files.

Processing Huge Files In C#

Question

5 answers

solution1
3 2012-04-30 17:21:01

solution2
3 2012-04-30 17:24:46

solution3
2 2012-04-30 17:23:00

solution4
1 2012-04-30 17:24:21

solution5
1 2012-04-30 17:26:34

Processing Huge Files In C#

Question

5 answers

solution1 3 2012-04-30 17:21:01

solution2 3 2012-04-30 17:24:46

solution3 2 2012-04-30 17:23:00

solution4 1 2012-04-30 17:24:21

solution5 1 2012-04-30 17:26:34

solution1
3 2012-04-30 17:21:01

solution2
3 2012-04-30 17:24:46

solution3
2 2012-04-30 17:23:00

solution4
1 2012-04-30 17:24:21

solution5
1 2012-04-30 17:26:34