简体繁体 English

局部敏感哈希或pHash？

[英]Locality Sensitive Hash or pHash?

原文 2016-05-09 08:23:04 4 1 c++/ hash/ locality-sensitive-hash/ phash

I'm trying to implement a general fingerprint memoizator : we have a file that can be expressed through an intelligent fingerprint (like pHash for images or chromaprint for audio) and if our desidered (expensive) function has already been computed on a similar file, then we return the same result (avoiding expensive computation). 我正在尝试实现一般的指纹记忆器：我们有一个可以通过智能指纹表达的文件（如用于图像的pHash或用于音频的色谱图），如果我们已经在类似的文件上计算了我们想要的（昂贵的）功能，然后我们返回相同的结果（避免昂贵的计算）。

Locality Sensitive Hash (LSH) is popular and well-performant solution for the Approximate nearest neighbor problem in an expensive multi-dimensional space. Locality Sensitive Hash （LSH）是一种流行且性能良好的解决方案，适用于昂贵的多维空间中的近似最近邻问题。

pHash is a good library which implements perceptual hashing for images. pHash是一个很好的库，可以实现图像的感知散列。

So pHash transform a multi-dimensional input (an image) to a one-dimensional object (an hash code), which is something different from LSH (again, multi-dimensional objects in LSH). 因此，pHash将多维输入（图像）转换为一维对象（哈希码），这与LSH（再次，LSH中的多维对象）不同。

So I'm wondering how we could implement a mono-dimensional LSH for pHash hash values? 所以我想知道如何为pHash哈希值实现单维LSH？ Or in a few words: how we can group in bins similar pHash values? 或者用几句话说：我们如何在箱子中分组相似的pHash值？ Could it be alternative to the classic LSH approach (and if not why)? 它可以替代经典的LSH方法（如果不是为什么）？

1 个解决方案

You could use n random projections to split pHash space into 2^n buckets, then similar images are most likely found from the same bucket. 您可以使用n 随机投影将pHash空间分成2^n桶，然后很可能从同一个桶中找到类似的图像。 You could even XOR the hash with all 64 possible integers with Hamming weight 1 to check neighboring buckets conveniently and be sure you'd find all approximate matches. 您甚至可以使用汉明重量为1的所有64个可能的整数对哈希进行异或运算，以方便地检查相邻的桶，并确保找到所有近似匹配。

This is efficient only if you are interested on images with almost identical hashes (small Hamming distance). 仅当您对具有几乎相同的哈希（小汉明距离）的图像感兴趣时，这才有效。 If you want to tolerate larger hamming distances (such as 8) then it gets tricky to find all matches efficiently and accurately. 如果你想要容忍更大的汉明距离（例如8），那么有效和准确地找到所有匹配变得棘手。 I got very good performance by scanning through the whole table by GPU, even my 3 year old laptop's GT 650M could check 700 million hashes / second! 通过GPU 扫描整个表格，我获得了非常好的性能，即使是我3岁的笔记本电脑的GT 650M也可以检查7亿个哈希/秒！

Edit 1 : You can think 64-bit hash as a single corner on a 64-dimensional cube, math is easier if your normalize corner coordinates to -1 and 1 (this way its center is in the origin). 编辑1 ：您可以将64位散列视为64维立方体上的单个角，如果您的标准化角坐标为-1和1 （这样它的中心位于原点），则数学会更容易。 You can express m images as a matrix M of size mx 64 (one row / image, one bit of hash / column). 您可以将m图像表示为大小为mx 64的矩阵M （一行/图像，一位散列/列）。

Simplest way to split this to 2^n distinct groups is to generate n 64-dimensional vectors v_0, v_1, ..., v_n (pick each vector element from normal distribution N(0,1)), this can be expressed as a matrix V of size 64 xn (one column / vector). 将此分割为2^n不同组的最简单方法是生成n个64维向量v_0, v_1, ..., v_n （从正态分布N（0,1）中选取每个向量元素），这可以表示为矩阵V ，大小为64 xn （一列/矢量）。 There could be orthogonality enforcement as mentioned at Random projection but I'll skip it here. 可能存在随机投影中提到的正交性执行，但我将在此处跳过它。

Now by calculating A = (M * V) > 0 you get mxn matrix (one image / row, one projection / column). 现在通过计算A = (M * V) > 0你得到mxn矩阵（一个图像/行，一个投影/列）。 Next convert each row's binary representation to a number, you get 2^n different possibilities and similar hashes are most likely to end up to the same bucket. 接下来将每行的二进制表示转换为数字，得到2^n种不同的可能性，类似的哈希最有可能结束到同一个桶。

This algorithm works for any orthogonal representation of data (such as SURF features), not just binary strings. 该算法适用于任何数据的正交表示（例如SURF特征），而不仅仅是二进制字符串。 I'm sure there are simpler (and computationally more efficient) algorithms for binary hashes but this is one way to implement random projections. 我确信二进制哈希算法有更简单（和计算效率更高）的算法，但这是实现随机投影的一种方法。

I suggested XORring because if images don't have identical hashes then they aren't guaranteed to end up in the same bucket. 我建议XORring，因为如果图像没有相同的哈希值，那么它们不能保证在同一个桶中结束。 By checking all possible small deviations from the original hash you can see which other bins are possible for likely matches. 通过检查原始哈希的所有可能的小偏差，您可以看到哪些其他箱可能匹配。

In a way this is similar how a computer game engine might split the 2D map into a grid of cells of size x , then to find all units within a radius x from a point you only need to check 9 cells (the one containing the point + 8 surrounding cells) to get a 100% accurate answer. 在某种程度上，这类似于计算机游戏引擎如何将2D地图分割成大小为x的单元格网格，然后从一个点中找到半径x内的所有单位，您只需要检查9个单元格（包含该点的单元格） + 8个周围的细胞）以获得100％准确的答案。