简体   繁体   中英

How in Python check if two files ( String and file ) have same content?

I am very new to Python and have question. How in Python check if two files ( String and file ) have same content ? I need to download some stuffs and rename, but I don't want to save same stuff with two or more different names (same stuff can be on different ip addresses ).

If the file is large, I would consider reading it in chunks like this:

compare.py:

import hashlib

teststr = "foo"
filename = "file.txt"

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data.encode('utf8'))
    return md5.digest()


md5 = hashlib.md5()
md5.update((teststr + "\n").encode('utf8'))
digest = md5.digest()
f = open(filename, 'r')
print(md5_for_file(f) == digest)

file.txt:

foo

This program prints True if the string and file match

Use sha1 hash of file content.

#!/usr/bin/env python
from __future__ import with_statement
from __future__ import print_function

from hashlib import sha1

def shafile(filename):
    with open(filename, "rb") as f:
        return sha1(f.read()).hexdigest()

if __name__ == '__main__':
    import sys
    import glob
    globber = (filename for arg in sys.argv[1:] for filename in glob.glob(arg))
    for filename in globber:
        print(filename, shafile(filename))

This program takes wildcards on the command line, but it is just for demonstration purposes.

It is not necessary to use a hash if all you want is a checksum. Python has a checksum in the binascii module .

binascii.crc32(data[, crc])

While hashes and checksums are great for comparing a list of files, if you are only comparing two specific files and don't have a pre-computed hash/checksum, then it is faster to compare the two files directly than it is to compute a hash/checksum for each and compare the hash/checksum

def equalsFile(firstFile, secondFile, blocksize=65536):
    buf1 = firstFile.read(blocksize)
    buf2 = secondFile.read(blocksize)
        while len(buf1) > 0:
        if buf1!=buf2:
            return False
        buf1, buf2 = firstFile.read(blocksize), secondFile.read(blocksize)
    return True

In my tests, 64 md5 checks on two 50MB files complete in 24.468 seconds, while 64 direct comparisons complete in just 4.770 seconds. This method also has the advantage of instantly returning false upon finding any difference, while calculating the hash must continue to read the entire file.

An additional way to create an early-fail on files that aren't identical is to just check their sizes before running the above test using os.path.getsize(filename) . This size difference is very common when checking equality of two files with different content, and so should always be the first thing you check.

import os
if os.path.getSize('file1.txt') != os.path.getSize('file2.txt'):
    print 'false'
else:
    print equalsFile(open('file1.txt', 'rb'), open('file1.txt', 'rb'))

The best way is to get some hash (ie md5) and compare it.

Here you can read how to get md5 of file.

For each file you download make a hash or a checksum. Keep a list of these hashes/checksums.

Then before saving the downloaded data to disk, check if the hash/checksum already exists in the list, and if it does, don't save it, but if it doesn't, save the file and add the checksum/hash to the list.

Pseudocode:

checksums = []
for url in all_urls:
    data = download_file(url)
    checksum = make_checksum(data)
    if checksum not in checksums:
         save_to_file(data)
         checksums.append(checksum)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM