简体   繁体   English

散列图像(RGB字节序列)

[英]Hashing an image (series of rgb bytes)

i'm developing an application that involves screen capture and hashing with C/C++. 我正在开发一个涉及使用C / C ++进行屏幕捕获和哈希处理的应用程序。 The image i'm capturing is about 250x250 in dimensions and i'm using the winapi HashData function for hashing. 我正在捕获的图像的尺寸约为250x250 ,并且我正在使用winapi HashData函数进行哈希处理。

My goal is to compare 2 hashes (etc. 2 images of 250x250) and instantly tell if they're equal. 我的目标是比较2个散列(例如2个250x250的图像),并立即判断它们是否相等。

My code: 我的代码:

           const int PIXEL_SIZE = (sc_part.height * sc_part.width)*3;
           BYTE* pixels = new BYTE[PIXEL_SIZE];
           for(UINT y=0,b=0;y<sc_part.height;y++) {
              for(UINT x=0;x<sc_part.width;x++) {
                 COLORREF rgb = sc_part.pixels[(y*sc_part.width)+x];
                 pixels[b++] = GetRValue(rgb);
                 pixels[b++] = GetGValue(rgb);
                 pixels[b++] = GetBValue(rgb);
              }
           }
           const int MAX_HASH_LEN = 64;
           BYTE Hash[MAX_HASH_LEN] = {0};
           HashData(pixels,PIXEL_SIZE,Hash,MAX_HASH_LEN);

           ... i have now my variable-size hash, above example uses 64 bytes

           delete[] pixels;

I've tested different hash sizes and their ~time for completion, which was roughly about: 我测试了不同的哈希大小及其完成的〜时间,大约是:

           32 bytes  = ~30ms
           64 bytes  = ~47ms
           128 bytes = ~65ms
           256 bytes = ~125ms

My question is: 我的问题是:

How long should the hash code be for a 250x250 image to prevent any duplicates, like never? 250x250图片的哈希码应保留多长时间,以防止重复(从来没有)?

I don't like a hash code of 256 characters, since it will cause my app to run slowly (since the captures are very frequent). 我不喜欢256个字符的哈希码,因为它会导致我的应用运行缓慢(因为捕获非常频繁)。 Is there a "safe" hash size per dimensions of image for comparing? 每个图像尺寸是否有"safe"哈希大小供比较?

thanx 感谢名单

Assuming, based on your comments, that you're adding the hash calculated "on-the-fly" to the database, and so the hash of every image in the database ends up getting compared to the hash of every other image in the database then you've run into the birthday paradox . 假设根据您的评论,您正在将“即时”计算出的哈希添加到数据库中,因此数据库中每个图像的哈希最终都将与数据库中其他图像的哈希进行比较然后您就遇到了生日悖论 The likelihood that there are two identical numbers in a set of randomly selected numbers (eg. the birthdays of group of people) is greater than what you'd intuitively assume. 一组随机选择的数字(例如一群人的生日)中存在两个相同的数字的可能性大于您直观地假设的可能性。 If there are 23 people in a room then there's a 50:50 chance two of them share the same birthday. 如果一个房间里有23个人,那么他们中的两个人有50:50的机会共享同一生日。

That means assuming a good hash function then you can expect a collision, two images having the same hash despite not being identical, after 2^(N/2) hashes, where N is the number bits in the hash. 这意味着假设一个好的哈希函数,那么您可以预期会发生冲突,即在2 ^(N / 2)个哈希之后,尽管哈希值不同,但具有相同哈希的两个图像,其中N是哈希中的位数。 1 If your hash function isn't so good you can expect a collision even earlier. 1如果您的哈希函数不太好,您可能会更早遇到冲突。 Unfortunately only Microsoft knows how good HashData actually is. 不幸的是,只有Microsoft知道HashData实际上有多好。

Your commments also bring up a couple of other issues. 您的评价还带来了其他一些问题。 One is that HashData doesn't produce variable sized hashes. 一种是HashData不会产生大小可变的哈希。 It produces an array of bytes that's always the same length as the value you passed as the hash length. 它产生一个字节数组,该字节数组的长度始终与您作为散列长度传递的值相同。 Your problem is that you're treating it instead as a string of characters. 您的问题是您将其视为字符串来代替。 In C++ strings are zero terminated, meaning that the end of string is marked with a zero valued character ( '\\0' ). 在C ++中,字符串以零结尾,这意味着字符串的结尾用零值字符( '\\0' )标记。 Since the array of bytes will contain 0 valued elements at random positions it will appear to be truncated when used a string. 由于字节数组将在随机位置包含0个值元素,因此在使用字符串时似乎会被截断。 Treating the hash a string like this will make it much more likely that you'll get a collision. 像这样处理字符串的哈希将使您更有可能遇到冲突。

The other issue is that you said that you stored the images being compared in your database and that these images must be unique. 另一个问题是您说过将要比较的图像存储在数据库中,并且这些图像必须是唯一的。 If this uniqueness is being enforced by the database then checking for uniqueness in your own code is redundant. 如果数据库强制执行此唯一性,则在您自己的代码中检查唯一性是多余的。 Your database might very well be able to do this faster than your own code. 您的数据库很可能比您自己的代码更快地执行此操作。

GUIDs (Globally Unique IDs) are 16 bytes long, and Microsoft assumes that no GUIDs will ever collide. GUID(全局唯一ID)的长度为16个字节,Microsoft假定没有GUID会发生冲突。

Using a 32 byte hash is equivalent to taking two randomly generated GUIDs and comparing them against two other randomly generated GUIDs. 使用32字节的哈希等效于获取两个随机生成的GUID,并将它们与其他两个随机生成的GUID进行比较。

The odds are vanishingly small (1/2^256) or 1.15792089E-77 that you will get a collision with a 32 byte hash. 赔率极小(1/2 ^ 256)或1.15792089E-77,您将与32字节哈希值发生冲突。

The universe will reach heat death long before you get a collision. 宇宙将在您撞到很久之前达到热死。

This comment from Michael Grier more or less encapsulates my beliefs. 迈克尔·格里尔(Michael Grier)的这一评论或多或少地概括了我的信念。 In the worst case, you should take an image, compute a hash, change the image by 1 byte, and recompute the hash. 在最坏的情况下,您应该拍摄一张图像,计算一个哈希值,将图像更改1个字节,然后重新计算哈希值。 A good hash should change by more than one byte. 一个好的哈希值应该改变一个以上的字节。

You also need to trade this off against the "birthday effect" (aka the pigeonhole principle) - any hash will generate collisions. 您还需要权衡这与“生日效应”(又称信鸽原则)的冲突-任何哈希都会产生冲突。 A quick comparison of the first N bytes, though, will typically reject collisions. 但是,快速比较前N个字节通常会拒绝冲突。

Cryptographic hashes are typically "better" hashes in the sense that more hash bits change per input bit change, but are much slower to compute. 加密散列通常是“更好”的散列,因为每个输入位的变化会改变更多的散列位,但是计算起来要慢得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM