简体   繁体   中英

Using HyperLogLog functions in BigQuery can you get different results from the same query on the same data?

My query looks like:

SELECT
    HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key.item) FROM UNNEST(data.list) key)),
FROM dataset

let's say I run this query 10000 times (on the same set of data), will I get 10000 identical results or a small percentage of times I might get slightly different outputs?

In the documentation I have not found explanations about this topic and I would like to understand this without having to run thousands times my query;)

I would say that you get identical results with a very high probability. However, without knowing the implementation in detail, you can't say that with 100% certainty. BigQuery is using the HLL++ algorithm as described in the paper HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm . There, a sparse representation is used for small cardinalities. The problem is that the maintenance of the sparse representation and also the criteria when to switch to the dense representation are not well defined. Possibly, the implementation shows a dependence on the data processing order. So, if the processing and merge orders are not well-defined, it might be the case that you end up either in the sparse representation or in the dense representation, which would result in slightly different estimates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM