简体   繁体   中英

How to sufficiently hash an image to avoid collisions?

I want to use hashes to uniquely identify photos from an Android phone , to answer queries of does server have xyz? and fetch image which hashes to xyz . I face this:

  1. Hashing the whole image is likely to be slow, hence I want to hash only the first few units (bytes) of the image file, not the whole file.
  2. The first few characters are insufficient due to composition , eg a user takes a photo of a scene, and then takes a second photo of the same scene after adding a paper clip at the bottom of the frame
  3. The first few characters are insufficient to avoid hash collisions, ie it may cause mixups between users.

How many characters must I hash from the image file, so that I keep the chance of a mishap low? Is there a better indexing scheme?

As soon as you leave any bytes out of the hash, you give someone the opportunity to create (either deliberately or accidentally) a file that differs only at those bytes, and hence hashes the same.

How different this image actually looks from the original depends to some extent how many bytes you leave out of the hash, and where. But you first have to decide what hash collisions you can tolerate (deliberate/accidental and major/minor), then you can think about what how fast a hash function you can use, and how much data you need to include in it.

Unless you're willing to tolerate a "largeish block" of data changing, you need to include bytes from every "largeish block" in the hash. From the point of view of I/O performance this means you need to access pretty much the whole file, since reading even one byte will cause the hardware to read the whole block that contains it.

Probably the thing to do is start with "definitely good enough", such as an SHA-256 hash of the whole file. See how much too slow that is, then think about how to improve performance by the required percentage. For example if it's only 50% too slow you could probably solve the problem with a faster (less secure) hash but still including all the data.

You can work out the limit of how fast you can go with a less secure hash by implementing some completely trivial hash (eg XOR of all the 4-byte words in the file), and see how fast that runs. If that's still too slow then you need to give up on accuracy and only hash part of the file (assuming you've already done your best to optimize the I/O).

If you're willing to tolerate collisions, then for most (all?) image formats, there's enough information in the header alone to uniquely identify "normal" photos. This won't protect you against deliberate collisions or against the results of image processing, but barring malice the timestamp, image size, camera model etc, together with even a small amount of image data will in practice uniquely identify every instance of "someone taking a photo of something". So on that basis, you could hash just the first 64-128k of the file (or less, I'm being generous to include the max size of an EXIF header plus some) and have a hash that works for most practical purposes but can be beaten if someone wants to.

Btw, unless done deliberately by a seriously competent photographer (or unless the image is post-processed deliberately to achieve this), taking two photos of the same scene with a small difference in the bottom right corner will not result in identical bytes at the beginning of the image data. Not even close, if you're in an environment where you can't control the light. Try it and see. It certainly won't result in an identical file when done with a typical camera that timestamps the image. So the problem is much easier if you're only trying to defend against accidents, than it is if you're trying to defend against deception.

我认为最有效的方法是选择随机字节(先前选择,并且整个静态)并计算XOR或其他一些简单的哈希应该足够好。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM