简体繁体中英

Locality Sensitive Hash or pHash?

原文 2016-05-09 08:23:04 6 1 c++/ hash/ locality-sensitive-hash/ phash

I'm trying to implement a general fingerprint memoizator : we have a file that can be expressed through an intelligent fingerprint (like pHash for images or chromaprint for audio) and if our desidered (expensive) function has already been computed on a similar file, then we return the same result (avoiding expensive computation).

Locality Sensitive Hash (LSH) is popular and well-performant solution for the Approximate nearest neighbor problem in an expensive multi-dimensional space.

pHash is a good library which implements perceptual hashing for images.

So pHash transform a multi-dimensional input (an image) to a one-dimensional object (an hash code), which is something different from LSH (again, multi-dimensional objects in LSH).

So I'm wondering how we could implement a mono-dimensional LSH for pHash hash values? Or in a few words: how we can group in bins similar pHash values? Could it be alternative to the classic LSH approach (and if not why)?

1 answers

You could use n random projections to split pHash space into 2^n buckets, then similar images are most likely found from the same bucket. You could even XOR the hash with all 64 possible integers with Hamming weight 1 to check neighboring buckets conveniently and be sure you'd find all approximate matches.

This is efficient only if you are interested on images with almost identical hashes (small Hamming distance). If you want to tolerate larger hamming distances (such as 8) then it gets tricky to find all matches efficiently and accurately. I got very good performance by scanning through the whole table by GPU, even my 3 year old laptop's GT 650M could check 700 million hashes / second!

Edit 1 : You can think 64-bit hash as a single corner on a 64-dimensional cube, math is easier if your normalize corner coordinates to -1 and 1 (this way its center is in the origin). You can express m images as a matrix M of size mx 64 (one row / image, one bit of hash / column).

Simplest way to split this to 2^n distinct groups is to generate n 64-dimensional vectors v_0, v_1, ..., v_n (pick each vector element from normal distribution N(0,1)), this can be expressed as a matrix V of size 64 xn (one column / vector). There could be orthogonality enforcement as mentioned at Random projection but I'll skip it here.

Now by calculating A = (M * V) > 0 you get mxn matrix (one image / row, one projection / column). Next convert each row's binary representation to a number, you get 2^n different possibilities and similar hashes are most likely to end up to the same bucket.

This algorithm works for any orthogonal representation of data (such as SURF features), not just binary strings. I'm sure there are simpler (and computationally more efficient) algorithms for binary hashes but this is one way to implement random projections.

I suggested XORring because if images don't have identical hashes then they aren't guaranteed to end up in the same bucket. By checking all possible small deviations from the original hash you can see which other bins are possible for likely matches.

In a way this is similar how a computer game engine might split the 2D map into a grid of cells of size x , then to find all units within a radius x from a point you only need to check 9 cells (the one containing the point + 8 surrounding cells) to get a 100% accurate answer.

pHash returning different hash lenght

Using pHash from .NET

Troubles when compiling phash program

Compile pHash on Ubuntu + PHP extension

Spatial locality in CUDA loops

What is meant by locality of declaration?

Qt5 - Link pHash library “undefined reference”

pHash image comparison implementation in C#

Cache locality performance

Locality and shared access to objects

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question pHash returning different hash lenght Using pHash from .NET Troubles when compiling phash program Compile pHash on Ubuntu + PHP extension Spatial locality in CUDA loops What is meant by locality of declaration? Qt5 - Link pHash library “undefined reference” pHash image comparison implementation in C# Cache locality performance Locality and shared access to objects

Related Tags

Locality Sensitive Hash or pHash?

Question

1 answers

solution1 1 ACCPTED 2016-05-16 14:47:32

solution1
1 ACCPTED 2016-05-16 14:47:32