简体   繁体   中英

How can I tell if two image files are the same in Perl?

I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I'd like to not save duplicates if I can get around it.

My question: What would be the best way to compare/check if they are the same?

My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?

Is there a better way?

EDIT:

Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? Ie it is guaranteed to always run at the exact same time everyday? Also: I'm looking at the last-modified headers on some of these, and they don't look 100% accurate, ie there are some that have a last-modified of over a week ago when I know the image is more recent than that. I'm assuming that's because the image file itself hasn't been modified on the server since then... which doesn't help me much...

  • Don't open and hash the stored image each time - stash the hash alongside the image when you store it. Compare sizes as well.

  • Don't issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.

There are a number of HTTP headers you can use for this -- if you save the time that you last retrieved the file, you can do a conditional get with

If-Modified-Since: <date>

Or, if the server returns an Etag header with the response, you can store that with the image, (or a collection of all of the etags you have seen for that image), and do:

If-None-Match: <all of your etags here>

If the server supports conditional gets, then you will get a "304 Not Modified" response, with no body.

md5 would work, but you'd still have to pull the file. Are there any useful metadata in the HTTP headers, content-length, cache-control directives, ETags, etc. ?

Yep that sounsd right. Depending on how you're getting the file and how frequently you might also be able to check for HTTP 304 Not Modified and save yourself the download.

There's also a nice fdupes tool for the purpose. Don't know what system you're using and what systems the tool can be built for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM