简体繁体中英

Which hash function does HyperLogLog use?

原文 2022-05-20 16:46:57 4 1 algorithm/ hash/ hyperloglog/ loglog

I have read in a few articles that HyperLogLog and LogLog use a hash function and that it is solely responsible for the prediction value. If we assign a value to a certain username to predict the number of times the individual has visited a page, and that value is constant for the name, the algorithm will contain bias .

For example, if Bob is hashed as 1000101 and its value is always going to be this, how will the algorithm predict the number of times a unique user has visited a page? All Bobs are always going to have a constant value 1 even if a new Bob visits the page. Could anyone explain how LogLog is able to predict the value of the unique visitors.

I know it uses a 64 bit hash and the above example is just for understanding.

1 answers

There are two separate processes involved here, and I think your confusion is coming from mixing the two of them up.

You can think of a HyperLogLog estimator as a black box that has two operations:

see(x) , which records that x has been seen, and
estimate() , which returns an estimate of how many distinct values have been see n.

In your example, you're discussing the possibility that multiple distinct people named Bob visit a website, and website then calls see(Bob) multiple times. In that case, you're right - HyperLogLog has no idea that these are different people named Bob. But that has nothing to do with the internal hash functions being used - that's purely a consequence of the fact that HyperLogLog only promises to estimate the number of distinct things it's seen, and if you treat each person named Bob as just "Bob," HyperLogLog has no way of knowing they aren't different people.

(As an analogy, imagine you asked me to count how many different people were in the list [Bob, Bob, Bob, Bob, Bob] . I'd assume the answer is "one," but if you then told me "actually, each Bob is a different person," my response would be "ah, I see, but the way you asked me to solve the problem wasn't fair because you didn't give me any information to differentiate the Bobs from one another.")

In practice, though, if you wanted to count distinct users to a website, you wouldn't just use a first name as a way of identifying a person. You might, for example, use a person's IP address, treating different IP addresses as different visitors. Or you might require a login of some sort, then use a person's internal ID number.

As for which hash function HyperLogLog uses, that's separate from the above discussion. The original paper on HyperLogLog doesn't say which hash function to use and instead just assumes it's given one that looks "random enough." In practice, you'd go with whichever hash function you thought looked "random enough," which could be something like the Jenkins hash, or SHA-256, etc. depending on context.

How does the HyperLogLog algorithm work?

Why does Perlin noise use a hash function rather than computing random values?

What algorithm does MongoDB use to hash ObjectIds?

What hashing algorithm does memcached use to hash keys?

Explanation on HyperLogLog algorithm

Applying HyperLogLog to a sample of the population

HyperLogLog Bucket size restriction

Does ready two-way hash function for strings with fixed hash length exist in Java?

Why does gzip use CRC instead of a general hash algorithm?

Does a hash function output need to be bounded less than the number of buckets?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How does the HyperLogLog algorithm work? Why does Perlin noise use a hash function rather than computing random values? What algorithm does MongoDB use to hash ObjectIds? What hashing algorithm does memcached use to hash keys? Explanation on HyperLogLog algorithm Applying HyperLogLog to a sample of the population HyperLogLog Bucket size restriction Does ready two-way hash function for strings with fixed hash length exist in Java? Why does gzip use CRC instead of a general hash algorithm? Does a hash function output need to be bounded less than the number of buckets?

Related Tags

Which hash function does HyperLogLog use?

Question

1 answers

solution1 0 ACCPTED 2022-05-20 22:07:07

solution1
0 ACCPTED 2022-05-20 22:07:07