简体   繁体   中英

A efficient quantiles algorithm/data structure that allows samples to be updated as they increment over time?

I'm looking for an efficient quantiles algorithm that allows sample values to be "upserted" or replaced as the value changes over time.

Let's say I have values for items 1-n . I'd like to put these into a quantiles algorithm that would efficiently store them. But then say at some point in the future, the value for item-i gets incremented. I'd like to remove the original value for item-i and replace it with the updated value. The specific use case is for a streaming system where the sample values are incrementing over time.

The closest I've seen to something like this is the t-Digest data structure . It stores sample values efficiently. The only thing it lacks is the ability to remove and replace a sample value.

I've also looked at Apache Quantiles Datasketch - it suffers from the same problem - no way to remove and replace a sample.

edit: thinking about this more, there wouldn't necessarily need to be a remove of the old value and an insertion of the incremented value. There might be a way to recalculate internal state more easily if there's a constraint that values can only be updated.

If update time O(log n) and quantile compute time O(log n) are acceptable for you then one of solutions would be to implement any type of self-balanced binary tree ( Splay tree , AVL-tree , Red-Black tree ) while keeping a HashMap<Key, Node> in parallel to the tree structure (or if you know that your keys are eg numbers 0 to n-1 , then you can just use an array for the same purposes). You will also need to keep a count of nodes in the subtree for each given node (which is possible with all of the mentioned self-balanced trees - it is a small addition to all methods which are doing updates on the nodes such as rotations, etc.).

Pseudo-code for updating value with key K, new value V would be:

Node node = find_node_in_hash_map_by_key(K); # O(1)
delete_node_keeping_subtree_counts_valid(node); # O(log n)
add_new_node_keeping_subtree_counts_valid(K, V); # O(log n)

Getting quantile q will be possible in O(log n) too because of the subtree sizes available in each node, because it pretty much gives you access to i-th element by size in O(log n) time. Pseudocode for that operation would look like:

# i-th element requested
node = root
while true:
    left = node.left_subtree
    left_count = 0
    if left is not None:
        left_count = left.nodes_count
    if i < left_count:
        node = left # select i-th element in the left subtree
    elif i == left_count:
        return node.value # we have exactly i elements in left subtree, so i-th value is in the current node
    else:
        i -= left_count + 1 # select element i - left_count - 1 from the right subtree
        node = node.right

I'm not aware of a good open-source JAVA solution for this data structure, but writing your own AVL tree is not that difficult (and Splay tree should be the easiest, just their worst case complexity is not O(log n) , but on average they should be good).

We can keep a Map from variable name to value and a SortedMap (a search tree) with keys composed from value and name (such as value + "_" + name, or a Comparable object with these two fields), so that the sorted keys are also the sorted values but we can also have unique keys in order to be able to remove the old value + variable name and introduce the new value + variable name. This is a technique used in HBase, which is not very different then a persistent TreeMap (self-balancing binary search tree).

Then computing quantiles, or percentiles is a matter of scanning the structure.

This is efficient when there is a high rate of updates relative to a low rate of quantiles asking.

When the rate of asking for quantiles is not that low, I do not have any good ideas, perhaps having also a set of heap structures, the kind of structures also indexed in a way to make removal more efficient, eg https://stackoverflow.com/questions/8705099/how-to-delete-in-a-heap-data-structure#:~:text=4%20Answers&text=Actually%2C%20you%20can%20remove%20an,parent%20of%20the%20old%20item .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM