简体   繁体   中英

Avoid duplication

I wish to avoid duplicate files using content based in any drive. Files may be at any formats . I dont know wat algorithm s used for fast search and how to compare all data formats . . Please help me sir

One easy way to check for identical content is to calculate some sort of checksum for both files and compare them.

You could do a MD5 calculation on the file content and then compare the results.

Create a Dictionary<> which will assign the list of filenames to each length of file in bytes. Fill this dictionary by traversing whole drive.

Dictionary<long,List<string>>

Now you have files sorted by their length. Since only file of same length can be the same, you have 90% of work done. :-)

Now you need to compare contents of the files. Small files (eg up to 10 MB) compare by brute force - load all files of the same size to memory, sort them by contents and searh for duplicates. (Complexity O(n*log2n) where n is number of same sized files.) Long files compare differently: Load only a start portion of the files, eg 10 MB, sort them and find duplicates and then continue only with those ones which will be the same. Continue recursively - ie call the compariosn function for the sublists of files with the same beginnings, load next 10 MB, sort, compare, call recursively etc. This way you get very good speed, and each file is loaded only once, and you don't need much memory. 10 MB block is selected because it is a good compromise between fast sequential reads and seek times of hard drives.

This algorithm should be faster than computing of MD5 of all files proposed by Max Kielland. Also, you don't need to load anything twice - if you go for hashes, you need to load files to compute hashes, and then load again if hash is the same to compare contents to be 100% sure that files are the same. If you have 100 same hashes, you get into big trouble how to compare these files. So I would prefer what I wrote above, ie no hashes.

Sometimes duplicate content has a reason. Don't go around deleting files without checking if everything still works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM