What is the fastest way of comparing two huge CSV files for a change?

Question

I think it is a architecture and/or design related question:

My scenerio is :

"

I export a huge amount of data from Db to a CSV.
I do it regularly.
I want to check if the last exported CSV data is different than the content of the previous exported data"

How can I achieve this (without having a need to loop and compare line by line) ?

Notes :

My exporter is a .Net console application.
My Db is MS-SQL (if you need to know)
My exporter is run regularly as a Scheduled TASK -within a PowerShell script

Answer 1

It sounds like you'd just want to generate a checksum of each CSV file to compare.
Calculate MD5 checksum for a file

using (var md5 = MD5.Create())
{
    using (var stream = File.OpenRead(filename))
    {
        return md5.ComputeHash(stream);
    }
}

Answer 2

You could have the database keep track of the time it was last modified. Simply add a trigger to that table and whenever any item is added/deleted/updated you could set a particular value to the current time. You then don't need to compare the large files in the first place; your export job can simply query the last modified time, compare it to the last modified time of the file on the file system, and determine if it needs to update it.

Answer 3

(This assumes you're doing it in Powershell, but these techniques apply to any language.)

I recommend checking file sizes first.

Do this first, it's quick!

if ((gci $file1).Length -ne (gci $file2).Length)
{
    Write-Host "Files are different!"
}
else
{
    # Same size, so compare contents...
}

Finally, you can do full-blown compare. If you're in PowerShell, take a look at Compare-Object (alias diff ). For example,

if (diff (gc $file1) (gc $file2))
{
    Write-Host "Files are different!"
}

It might be faster to do a buffered byte-to-byte comparison, as seen here: http://keestalkstech.blogspot.com/2010/11/comparing-two-files-in-powershell.html

Alternatives:

An MD5 comparison might actually be slower than a byte-to-byte comparison. Not only do you need to read in the files, but then you also have to perform computation to get the hash. You can at least optimize by caching the hash of the old file--saving half the I/O.

Cause you're exporting a database table, most databases add rows to the end. You'll have to make sure this is your case, and that you're just adding and not updating. If so, you can just compare the last row in your file; eg the last 4K or however big your row size is.

What is the fastest way of comparing two huge CSV files for a change?

Question

3 answers

solution1
6 ACCPTED 2012-11-08 14:09:46

solution2
1 2012-11-08 14:38:35

solution3
0 2012-11-09 00:53:56

What is the fastest way of comparing two huge CSV files for a change?

Question

3 answers

solution1 6 ACCPTED 2012-11-08 14:09:46

solution2 1 2012-11-08 14:38:35

solution3 0 2012-11-09 00:53:56

solution1
6 ACCPTED 2012-11-08 14:09:46

solution2
1 2012-11-08 14:38:35

solution3
0 2012-11-09 00:53:56