简体   繁体   中英

Set-based hash (digest) algorithm?

Is there any message digest algorithm that you can apply set functions on the digest and the result still makes sense? In other words, is there a hash function that does NOT break the concept of "set" before and after hashing?

I'm looking for a hash function that:

  1. hashes a set of data into a fixed-length (or bounded-length) string
  2. produces identical hash if the input data set is the same
  3. if you select a subset of your raw data, it is equivalent to either hash the data subset, or apply the subset to the hash of the original data set, ie you will get the same subset hash in the both ways.

As an example, in the following picture set A has several data points (red dimonds). B is a subset of A. Is there such a hash function that:

data in A ---- hash function ----> _hashA ---- set operation ----> _hashB

data in B ---- hash function ----> _hashB

在此处输入图片说明

This looks a bit like http://en.wikipedia.org/wiki/Homomorphic_encryption and a bit like database privacy schemes like http://en.wikipedia.org/wiki/Differential_privacy - at least to me.

In both cases developers have had problems because it turned out that once you let users do a few things they could find clever ways to work out how to do anything they wanted using those few things as building blocks so the system lacked any security at all.

In your case I think you want AndHash(hash(a), hash(b)) = hash(a and b). This means that if hash(a) != hash(null set) then I can find out if a is a member of any set based on the hash value of that set. If this happens a lot I can work out many of the members of a hashed set given its hash value, which means that the hash value must be pretty much as big as the set, as it contains all the information in it.

Depending on what you want this for, it might be worth looking at http://en.wikipedia.org/wiki/Minhash .

AFAIK, no. Hash functions generally (and I've seen many) operate on a single chunk of data without any regard whatsoever for what that data may actually represent, the primary concern being to reduce to probability of collisions. That said, it's certainly possible to come up with something like what you're wanting to do, but I imagine it would be exceedingly difficult, and the result most likely suboptimal in terms of collision-avoidance.

The short answer is no, there isn't such an algorithm. What you might try is encrypting your data and then decrypting it when you need to apply your set function, then encrypting it again. Hashing algorithms, however, are by their very nature one way and involve the loss of data. There's a good explanation of the difference between hash and encryption algorithms here: Fundamental difference between Hashing and Encryption algorithms

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM