简体   繁体   English

找到数据中迄今为止的前 K 个最小值 Stream

[英]Find Top-K Smallest Values So Far in Data Stream

Let's say that I have a data stream where single data point is retrieved at a time:假设我有一个数据 stream,其中一次检索单个数据点:

import numpy as np
def next_data_point():
    """
    Mock a data stream. Data points will always be a positive float
    """
    return np.random.uniform(0, 1_000_000, dtype='float')

I need to be able to update a NumPy array and track the top-K smallest-values-so-far from this stream (or until the user decides when it is okay to stop the analysis via some check_stop_condition() function).我需要能够更新一个 NumPy 数组并从这个 stream 跟踪到目前为止的前 K 个最小值(或者直到用户决定何时可以通过某些check_stop_condition()函数停止分析)。 Let's say we want to capture the top 1,000 smallest values from the stream, then a naive way to accomplish this might be:假设我们想要从 stream 中捕获前 1,000 个最小值,那么实现此目的的一种天真的方法可能是:

k = 1000
topk = np.full(k, fille_value=np.inf, dtype='float')
while check_stop_condition():
    topk[:] = np.sort(np.append(topk, next_data_point()))[:k]

This works fine but is quite inefficient and can be slow if repeated millions of times since we are:这工作正常但效率很低,如果重复数百万次可能会很慢,因为我们是:

  1. creating a new array every time每次创建一个新数组
  2. sorting the concatenated array every time每次对连接的数组进行排序

So, I came up with a different approach to address these 2 inefficiencies:因此,我想出了一种不同的方法来解决这两个低效率问题:

k = 1000
topk = np.full(k, fille_value=np.inf)
while check_stop_condition():
    data_point = next_data_point()
    idx = np.searchsorted(topk, data_point)
    if idx < k:
        topk[idx : -1] = topk[idx + 1 :] 
        topk[idx] = data_point 

Here, I leverage np.searchsorted() to replace np.sort and to quickly find the insertion point, idx , for the next data point.在这里,我利用np.searchsorted()替换np.sort并快速找到下一个数据点的插入点idx I believe that np.searchsorted uses some sort of binary search and assumes that the initial array is pre-sorted first.我相信np.searchsorted使用某种二进制搜索并假设初始数组首先被预先排序。 Then, we shift the data in topk to accommodate and insert the new data point if and only if idx < k .然后,当且仅当idx < k时,我们移动topk中的数据以容纳和插入新数据点。

I haven't seen this being done anywhere and so my question is if there is anything that can be done to make this even more efficient?我还没有在任何地方看到这样做,所以我的问题是是否可以做些什么来提高效率? Especially in the way that I shifting things around inside the if statement.特别是我在if语句中移动东西的方式。

Sorting a huge array is very expensive so this is not surprising the second method is faster.对一个巨大的数组进行排序非常昂贵,因此第二种方法更快也就不足为奇了。 However, the speed of the second method is probably bounded by the slow array copy.然而,第二种方法的速度可能受慢数组复制的限制。 The complexity of the first method is O(k log(k) n) while the second method has a complexity of O(n (log(k) + k * p)) , with n the number of points and p the probability of the branch to be taken.第一种方法的复杂度为O(k log(k) n)而第二种方法的复杂度为O(n (log(k) + k * p)) ,其中n是点数, p是点的概率要采取的分支。

To build a faster implementation, you can use a tree .要构建更快的实现,您可以使用 More specifically aself-balancing binary search tree for example.更具体地说,例如自平衡二叉搜索树 Here is the algorithm:这是算法:

topk = Tree()
maxi = np.inf
while check_stop_condition():             # O(n)
    data_point = next_data_point()
    if len(topk) <= 1000:                 # O(1)
        topk.insert(data_point)           # O(log k)
    elif data_point < maxi:               # Discard the value in O(1)
        topk.insert(data_point)           # O(log k)
        topk.deleteMaxNode()              # O(log k)
        maxi = topk.findMaxValue()        # O(log k)

The above algorithm run in O(n log k) .上述算法在O(n log k)中运行。 One can show that this complexity is optimal (using only data_point comparisons).可以证明这种复杂性是最优的(仅使用data_point比较)。

In practice, binary heaps can be a bit faster (with the same complexity).实际上,二叉堆可以更快一点(具有相同的复杂性)。 Indeed, they have several advantage over self-balancing binary search trees in this case:事实上,在这种情况下,它们比自平衡二叉搜索树有几个优势:

  • they can be implemented in a very compact way in memory (reducing cache misses and memory consumption)它们可以在 memory 中以非常紧凑的方式实现(减少缓存未命中和 memory 消耗)
  • insertion of the n=1000 first items can be done in O(n) time and very quickly可以在O(n)时间内非常快速地插入n=1000第一个项目

Note that discarded values are computed in constant time.请注意,丢弃的值是在恒定时间内计算的。 This appends a lot on huge random datasets as most of the values get quickly bigger than maxi .这在巨大的随机数据集上增加了很多,因为大多数值很快就会大于maxi On can even prove that random datasets can be computed in O(n) time (optimal). On 甚至可以证明随机数据集可以在O(n)时间内计算(最优)。

Note that Python 3 provides a standard heap implementation called heapq which is probably a good starting point.请注意,Python 3 提供了一个称为heapq的标准堆实现,这可能是一个很好的起点。

Let's say that I have a data stream where single data point is retrieved at a time:假设我有一个数据流,其中一次检索单个数据点:

import numpy as np
def next_data_point():
    """
    Mock a data stream. Data points will always be a positive float
    """
    return np.random.uniform(0, 1_000_000, dtype='float')

I need to be able to update a NumPy array and track the top-K smallest-values-so-far from this stream (or until the user decides when it is okay to stop the analysis via some check_stop_condition() function).我需要能够更新 NumPy 数组并跟踪距此流较远的前 K 个最小值(或直到用户决定何时可以通过某个check_stop_condition()函数停止分析)。 Let's say we want to capture the top 1,000 smallest values from the stream, then a naive way to accomplish this might be:假设我们想要从流中捕获前 1,000 个最小值,那么实现此目的的一种天真的方法可能是:

k = 1000
topk = np.full(k, fille_value=np.inf, dtype='float')
while check_stop_condition():
    topk[:] = np.sort(np.append(topk, next_data_point()))[:k]

This works fine but is quite inefficient and can be slow if repeated millions of times since we are:这工作正常,但效率很低,如果重复数百万次可能会很慢,因为我们是:

  1. creating a new array every time每次创建一个新数组
  2. sorting the concatenated array every time每次对连接的数组进行排序

So, I came up with a different approach to address these 2 inefficiencies:因此,我想出了一种不同的方法来解决这两个低效率问题:

k = 1000
topk = np.full(k, fille_value=np.inf)
while check_stop_condition():
    data_point = next_data_point()
    idx = np.searchsorted(topk, data_point)
    if idx < k:
        topk[idx : -1] = topk[idx + 1 :] 
        top[idx] = data_point 

Here, I leverage np.searchsorted() to replace np.sort and to quickly find the insertion point, idx , for the next data point.在这里,我利用np.searchsorted()替换np.sort并快速找到下一个数据点的插入点idx I believe that np.searchsorted uses some sort of binary search and assumes that the initial array is pre-sorted first.我相信np.searchsorted使用某种二进制搜索并假设初始数组首先被预先排序。 Then, we shift the data in topk to accommodate and insert the new data point if and only if idx < k .然后,当且仅当idx < k ,我们在topk移动数据以容纳并插入新数据点。

I haven't seen this being done anywhere and so my question is if there is anything that can be done to make this even more efficient?我还没有在任何地方看到过这种情况,所以我的问题是是否可以采取任何措施来提高效率? Especially in the way that I shifting things around inside the if statement.尤其是我在if语句中移动事物的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM