简体   繁体   English

在python中找到变化集的最小值和最大值的有效方法

[英]Efficient way to find the min and max value of a changing set in python

I need to find the min/max value in a changing large set, in C++, it could be我需要在一个不断变化的大集合中找到最小值/最大值,在 C++ 中,它可能是

#include<set>
using namespace std;
int minVal(set<int> & mySet){
    return *mySet.begin();
}
int maxVal(set<int> & mySet){
    return *mySet.rbegin();
}
int main(){
    set <int> mySet;
    for(..;..;..){
       // add or delete element in mySet
       ...
       // print the min and max value in the set
       printf("%d %d\n", minVal(mySet), maxVal(mySet)); 
    }
}

In C++, each query operation is O(1), but in python, I tried to use the build-in method min and max but it's too slow.在 C++ 中,每个查询操作都是 O(1),但是在 python 中,我尝试使用内置方法 min 和 max 但它太慢了。 Each min/max operation takes O(n) time (n is the length of my Set).每个最小/最大操作需要 O(n) 时间(n 是我的 Set 的长度)。 Are there any elegant and efficient way to do this?有没有优雅有效的方法来做到这一点? Or any datatype support these operation?或者任何数据类型支持这些操作?

mySet=set()
for i in range(..):
  # add or delete element in mySet
  ...
  # print the min and max value in the set
  print(min(mySet),max(mySet))

The efficient implementation in terms of complexity is wrapping a python set (which uses a hash table) and keeping a pair of maxElement and minElement attributes in the object, and updating those accordingly when adding or removing elements.在复杂性方面的有效实现是包装一个 python set (使用哈希表)并在对象中保留一对maxElementminElement属性,并在添加或删除元素时相应地更新这些属性。 This keeps every query of existence, min and max O(1).这保留了每个存在的查询,最小和最大 O(1)。 The deletion operation though would be O(n) worst case with the simplest implementation (since you have to find the next-to-minimum element if you happen to remove the minimum element, and the same happens with the maximum).但是,使用最简单的实现,删除操作将是 O(n) 最坏情况(因为如果您碰巧删除了最小元素,则必须找到次最小元素,而最大值也是如此)。

This said, the C++ implementation uses a balanced search tree which has O(log n) existence checks, deletion and insertion operations.这就是说,C++ 实现使用具有 O(log n) 存在检查、删除和插入操作的平衡搜索树。 You can find an implementation of this type of data structure in the bintrees package.您可以在bintrees包中找到此类数据结构的实现

I wouldn't use just a heapq as suggested in comments as a heap is O(n) for checking existence of elements (main point of a set data structure I guess, which I assume you need).我不会像评论中建议的那样只使用heapq ,因为堆是 O(n) 来检查元素的存在(我猜是集合数据结构的要点,我假设您需要)。

You could use two priority queues to maintain min and max values in the set, respectively.您可以使用两个优先级队列分别维护集合中的最小值和最大值。 Unfortunately, the stdlib's heapq doesn't support removing entries from the queue in O(log n) time out of the box.不幸的是,stdlib 的heapq不支持在O(log n)时间内从队列中移除条目。 The suggested workaround is to just mark entries as removed and discard them when you pop them from the queue (which might be ok in many scenarios, though).建议的解决方法是将条目标记为已删除,并在您将它们从队列中弹出时丢弃它们(尽管在许多情况下这可能没问题)。 Below is a Python class implementing that approach:下面是一个实现该方法的 Python 类:

from heapq import heappop, heappush

class MinMaxSet:
    def __init__(self):
        self.min_queue = []
        self.max_queue = []
        self.entries = {}  # mapping of values to entries in the queue

    def __len__(self):
        return len(self.entries)

    def add(self, val):
        if val not in self.entries:
            entry_min = [val, False]
            entry_max = [-val, False]

            heappush(self.min_queue, entry_min)
            heappush(self.max_queue, entry_max)

            self.entries[val] = entry_min, entry_max

    def delete(self, val):
        if val in self.entries:
            entry_min, entry_max = self.entries.pop(val)
            entry_min[-1] = entry_max[-1] = True  # deleted

    def get_min(self):
        while self.min_queue[0][-1]:
            heappop(self.min_queue)
        return self.min_queue[0][0]

    def get_max(self):
        while self.max_queue[0][-1]:
            heappop(self.max_queue)
        return -self.max_queue[0][0]

Demo:演示:

>>> s = MinMaxSet()
>>> for x in [1, 5, 10, 14, 11, 14, 15, 2]:
...     s.add(x)
... 
>>> len(s)
7
>>> print(s.get_min(), s.get_max())
1 15
>>> s.delete(1)
>>> s.delete(15)
>>> print(s.get_min(), s.get_max())
2 14

Since 2020 package bintrees is depricated and should be replaced with sortedcontainers .自 2020 年包 bintrees 已弃用,应替换为sortedcontainers

Example usage:用法示例:

import sortedcontainers

s = sortedcontainers.SortedList()
s.add(10)
s.add(3)
s.add(25)
s.add(8)
min = s[0]      # read min value
min = s.pop(0)  # read and remove min value
max = s[-1]     # read max value
max = s.pop()   # read and remove max value

Beside SortedList you also have SortedDict and SortedSet.除了 SortedList,您还有 SortedDict 和 SortedSet。 Here is API documentation .这里是API 文档

numpy min max is twice as fast as the native method numpy min max 是本地方法的两倍

import time as t
import numpy as np

def initialize():
    storage.reset()

def tick():

    array = data.btc_usd.period(250, 'close')

    t1 = t.time()

    a = min(array)
    b = max(array)

    t2 = t.time()

    c = np.min(array)
    d = np.max(array)

    t3 = t.time()

    storage.t1 = storage.get('t1', 0)
    storage.t2 = storage.get('t2', 0)
    storage.t1 += t2-t1
    storage.t2 += t3-t2


def stop():

    log('python: %.5f' % storage.t1)
    log('numpy: %.5f' % storage.t2)
    log('ticks: %s' % info.tick)

yeilds:产量:

[2015-11-06 10:00:00] python: 0.45959
[2015-11-06 10:00:00] numpy: 0.26148
[2015-11-06 10:00:00] ticks: 7426

but I think you're looking for something more like this:但我认为你正在寻找更像这样的东西:

import time as t
import numpy as np

def initialize():
    storage.reset()

def tick():

    storage.closes = storage.get('closes', [])
    if info.tick == 0:
        storage.closes = [float(x) for x in data.btc_usd.period(250, 'close')]
    else:
        z = storage.closes.pop(0) #pop left
        price = float(data.btc_usd.close)
        storage.closes.append(price) #append right
    array = np.array(storage.closes)[-250:]

    # now we know 'z' just left the list and 'price' just entered
    # otherwise the array is the same as the previous example

    t1 = t.time()
    # PYTHON METHOD
    a = min(array)
    b = max(array)

    t2 = t.time()
    # NUMPY METHOD
    c = np.min(array)
    d = np.max(array)

    t3 = t.time()
    # STORAGE METHOD
    storage.e = storage.get('e', 0)
    storage.f = storage.get('f', 0)
    if info.tick == 0:
        storage.e = np.min(array)
        storage.f = np.max(array)
    else:
        if z == storage.e:
            storage.e = np.min(array)
        if z == storage.f:
            storage.f = np.max(array)
        if price < storage.e:
            storage.e = price
        if price > storage.f:
            storage.f = price

    t4 = t.time()

    storage.t1 = storage.get('t1', 0)
    storage.t2 = storage.get('t2', 0)
    storage.t3 = storage.get('t3', 0)    
    storage.t1 += t2-t1
    storage.t2 += t3-t2
    storage.t3 += t4-t3


def stop():

    log('python: %.5f'  % storage.t1)
    log('numpy: %.5f'   % storage.t2)
    log('storage: %.5f' % storage.t3)
    log('ticks: %s'     % info.tick)

yeilds:产量:

[2015-11-06 10:00:00] python: 0.45694
[2015-11-06 10:00:00] numpy: 0.23580
[2015-11-06 10:00:00] storage: 0.16870
[2015-11-06 10:00:00] ticks: 7426

which brings us down to about 1/3rd of the native method with a 7500 iterations against a list of 250这使我们下降到本地方法的 1/3 左右,对 250 个列表进行 7500 次迭代

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM