简体   繁体   English

Python中按键排序的dict

[英]Key-ordered dict in Python

I am looking for a solid implementation of an ordered associative array, that is, an ordered dictionary. 我正在寻找有序关联数组的可靠实现,即有序字典。 I want the ordering in terms of keys, not of insertion order. 我想要按键的顺序,而不是插入顺序。

More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which: 更确切地说,我正在寻找一个空间效率的实现int-to-float(或另一个用例的字符串到float)映射结构,其中:

  • Ordered iteration is O(n) 有序迭代是O(n)
  • Random access is O(1) 随机访问是O(1)

The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert. 我想出的最好的方法是粘贴一个字典和一个键列表,保留最后一个用bisect和insert命令。

Any better ideas? 有更好的想法吗?

"Random access O(1)" is an extremely exacting requirement which basically imposes an underlying hash table -- and I hope you do mean random READS only, because I think it can be mathematically proven than it's impossible in the general case to have O(1) writes as well as O(N) ordered iteration. “随机访问O(1)”是一个非常严格的要求,它基本上强加了一个底层哈希表 - 我希望你的意思只是随机READS,因为我认为它可以在数学上证明,而不是在一般情况下不可能有O (1)写入以及O(N)有序迭代。

I don't think you will find a pre-packaged container suited to your needs because they are so extreme -- O(log N) access would of course make all the difference in the world. 我不认为你会找到一个适合你需要的预包装容器,因为它们非常极端--O(log N)访问当然会让世界变得与众不同。 To get the big-O behavior you want for reads and iterations you'll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. 要获得读取和迭代所需的大O行为,您需要粘合两个数据结构,实质上是dict和堆(或排序列表或树),并使它们保持同步。 Although you don't specify, I think you'll only get amortized behavior of the kind you want - unless you're truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement. 虽然您没有指定,但我认为您只会得到您想要的那种摊销行为 - 除非您真的愿意为插入和删除支付任何性能命中,这是您表达的规范的字面含义但是似乎是一个非常不可能的现实生活要求。

For O(1) read and amortized O(N) ordered iteration, just keep a list of all keys on the side of a dict. 对于O(1)读取和摊销的 O(N)有序迭代,只需保留dict一侧的所有键的列表。 Eg: 例如:

class Crazy(object):
  def __init__(self):
    self.d = {}
    self.L = []
    self.sorted = True
  def __getitem__(self, k):
    return self.d[k]
  def __setitem__(self, k, v):
    if k not in self.d:
      self.L.append(k)
      self.sorted = False
    self.d[k] = v
  def __delitem__(self, k):
    del self.d[k]
    self.L.remove(k)
  def __iter__(self):
    if not self.sorted:
      self.L.sort()
      self.sorted = True
    return iter(self.L)

If you don't like the "amortized O(N) order" you can remove self.sorted and just repeat self.L.sort() in __setitem__ itself. 如果你不喜欢“摊销的O(N)订单”,你可以删除self.sorted并在__setitem__重复self.L.sort() That makes writes O(N log N), of course (while I still had writes at O(1)). 这使得写入O(N log N),当然(虽然我仍然在O(1)处写入)。 Either approach is viable and it's hard to think of one as intrinsically superior to the other. 这两种方法都是可行的,并且很难将其视为本质上优于另一种方法。 If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; 如果你倾向于做一堆写操作然后进行一堆迭代,那么上面代码中的方法是最好的; if it's typically one write, one iteration, another write, another iteration, then it's just about a wash. 如果它通常是一次写入,一次迭代,另一次写入,另一次迭代,那么它只是一次洗涤。

BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python's sort (aka "timsort"): among them, sorting a list that's mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). 顺便说一下,这需要Python的排序(又名“timsort”)的不寻常(和精彩;-)性能特征的无耻优势:其中,排序一个主要排序但最后添加了一些额外项目的列表基本上是O (N)(如果与已排序的前缀部分相比,所添加的项目足够少)。 I hear Java's gaining this sort soon, as Josh Block was so impressed by a tech talk on Python's sort that he started coding it for the JVM on his laptop then and there. 我听说Java很快就会获得这种类型,因为Josh Block对Python的技术谈话印象深刻,他开始在他的笔记本电脑上为JVM编写代码。 Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of "mostly ordered" inputs; 大多数系统(包括我相信今天的Jython和IronPython)基本上都将排序作为O(N log N)操作,而不是利用“大多数有序”输入; "natural mergesort", which Tim Peters fashioned into Python's timsort of today, is a wonder in this respect. 蒂姆·彼得斯(Tim Peters)塑造成今天Python时代的“自然融合”,在这方面是一个奇迹。

The sortedcontainers module provides a SortedDict type that meets your requirements. sortedcontainers模块提供符合您要求的SortedDict类型。 It basically glues a SortedList and dict type together. 它基本上将SortedList和dict类型粘合在一起。 The dict provides O(1) lookup and the SortedList provides O(N) iteration (it's extremely fast). dict提供O(1)查找,SortedList提供O(N)迭代(它非常快)。 The whole module is pure-Python and has benchmark graphs to backup the performance claims (fast-as-C implementations). 整个模块是纯Python,并具有基准图来备份性能声明(快速实现C)。 SortedDict is also fully tested with 100% coverage and hours of stress. SortedDict也经过全面测试,覆盖范围100%,压力小时。 It's compatible with Python 2.6 through 3.4. 它与Python 2.6到3.4兼容。

Here is my own implementation: 这是我自己的实现:

import bisect
class KeyOrderedDict(object):
   __slots__ = ['d', 'l']
   def __init__(self, *args, **kwargs):
      self.l = sorted(kwargs)
      self.d = kwargs

   def __setitem__(self, k, v):
      if not k in self.d:
         idx = bisect.bisect(self.l, k)
         self.l.insert(idx, k)
       self.d[k] = v

   def __getitem__(self, k):
      return self.d[k]

   def __delitem__(self, k):
      idx = bisect.bisect_left(self.l, k)
      del self.l[idx]
      del self.d[k]

   def __iter__(self):
      return iter(self.l)

   def __contains__(self, k):
      return k in self.d

The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). bisect的使用保持self.l有序,插入是O(n)(因为插入,但在我的情况下不是杀手,因为我追加的次数远远超过真正的插入,所以通常的情况是摊销O(1) ))。 Access is O(1), and iteration O(n). 访问是O(1),迭代是O(n)。 But maybe someone had invented (in C) something with a more clever structure ? 但也许有人发明了(在C中)具有更聪明结构的东西?

An ordered tree is usually better for this cases, but random access is going to be log(n). 对于这种情况,有序树通常更好,但随机访问将是log(n)。 You should keep into account also insertion and removal costs... 您还应该考虑插入和移除成本......

The ordereddict package ( http://anthon.home.xs4all.nl/Python/ordereddict/ ) that I implemented back in 2007 includes sorteddict. 我在2007年实现的ordereddict包( http://anthon.home.xs4all.nl/Python/ordereddict/ )包括sorteddict。 sorteddict is a KSO ( Key Sorted Order) dictionary. sorteddict是一个KSO(密钥排序)字典。 It is implemented in C and very space efficient and several times faster than a pure Python implementation. 它以C语言实现,非常节省空间,比纯Python实现快几倍。 Downside is that only works with CPython. 缺点是只适用于CPython。

>>> from _ordereddict import sorteddict
>>> x = sorteddict()
>>> x[1] = 1.0
>>> x[3] = 3.3
>>> x[2] = 2.2
>>> print x
sorteddict([(1, 1.0), (2, 2.2), (3, 3.3)])
>>> for i in x:
...    print i, x[i]
... 
1 1.0
2 2.2
3 3.3
>>> 

Sorry for the late reply, maybe this answer can help others find that library. 对不起,迟到的回复,也许这个答案可以帮助其他人找到该库。

You could build a dict that allows traversal by storing a pair (value, next_key) in each position. 你可以通过在每个位置存储一对(value, next_key)来构建一个允许遍历的字典。

Random access: 随机访问:

my_dict[k][0]   # for a key k

Traversal: 穿越:

k = start_key   # stored somewhere
while k is not None:     # next_key is None at the end of the list
    v, k = my_dict[k]
    yield v

Keep a pointer to start and end and you'll have efficient update for those cases where you just need to add onto the end of the list. 保持一个指向startend的指针,您将有效地更新那些只需要添加到列表末尾的情况。

Inserting in the middle is obviously O(n). 插入中间显然是O(n)。 Possibly you could build a skip list on top of it if you need more speed. 如果你需要更快的速度,你可以在它上面建立一个跳过列表

我不确定您使用的是哪个python版本,但是如果您想要进行实验,Python 3.1包含和正式实现的有序词典: http//www.python.org/dev/peps/pep-0372/ http ://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries

here's a pastie: I Had a need for something similar. 这是一个馅饼:我需要类似的东西。 Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn't quite match what you're asking for, however. 但请注意,这个特定的实现是不可变的,一旦创建实例就没有插入:但是,确切的性能并不完全符合您的要求。 Lookup is O(log n) and full scan is O(n). 查找为O(log n),全扫描为O(n)。 This works using the bisect module upon a tuple of key/value (tuple) pairs. 这使用bisect模块对键/值(元组)对的元组进行操作。 Even if you can't use this precisely, you might have some success adapting it to your needs. 即使您无法准确地使用它,您也可能会成功地根据您的需求进行调整。

import bisect

class dictuple(object):
    """
        >>> h0 = dictuple()
        >>> h1 = dictuple({"apples": 1, "bananas":2})
        >>> h2 = dictuple({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        ('apples':1, 'bananas':3, 'mangoes':5)
        >>> h1 > h2
        False
        >>> h1 > 6
        False
        >>> 'apples' in h1
        True
        >>> 'apples' in h2
        False
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: ('bananas':3, 'mangoes':5)
   """


    def __new__(cls, *args, **kwargs):
        initial = {}
        args = [] if args is None else args
        for arg in args:
            initial.update(arg)
        initial.update(kwargs)

        instance = object.__new__(cls)
        instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
        return instance

    def __init__(self,*args, **kwargs):
        pass

    def __find(self,key):
        return bisect.bisect(self.__items, (key,))


    def __getitem__(self, key):
        ind = self.__find(key)
        if self.__items[ind][0] == key:
            return self.__items[ind][1]
        raise KeyError(key)
    def __repr__(self):
        return "({0})".format(", ".join(
                        "{0}:{1}".format(repr(item[0]),repr(item[1]))
                          for item in self.__items))
    def __contains__(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key
    def __cmp__(self,other):

        return cmp(self.__class__.__name__, other.__class__.__name__
                  ) or cmp(self.__items, other.__items)
    def __eq__(self,other):
        return self.__items == other.__items
    def __format__(self,key):
        pass
    #def __ge__(self,key):
    #    pass
    #def __getattribute__(self,key):
    #    pass
    #def __gt__(self,key):
    #    pass
    __seed = hash("dictuple")
    def __hash__(self):
        return dictuple.__seed^hash(self.__items)
    def __iter__(self):
        return self.iterkeys()
    def __len__(self):
        return len(self.__items)
    #def __reduce__(self,key):
    #    pass
    #def __reduce_ex__(self,key):
    #    pass
    #def __sizeof__(self,key):
    #    pass

    @classmethod
    def fromkeys(cls,key,v=None):
        cls(dict.fromkeys(key,v))

    def get(self,key, default):
        ind = self.__find(key)
        return self.__items[ind][1] if self.__items[ind][0] == key else default

    def has_key(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key

    def items(self):
        return list(self.iteritems())

    def iteritems(self):
        return iter(self.__items)

    def iterkeys(self):
        return (i[0] for i in self.__items)

    def itervalues(self):
        return (i[1] for i in self.__items)

    def keys(self):
        return list(self.iterkeys())

    def values(self):
        return list(self.itervalues())
    def __add__(self, other):
        _sum = dict(self.__items)
        _sum.update(other.__items)
        return self.__class__(_sum)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

For "string to float" problem you can use a Trie - it provides O(1) access time and O(n) sorted iteration. 对于“字符串到浮动”问题,您可以使用Trie - 它提供O(1)访问时间和O(n)排序迭代。 By "sorted" I mean "sorted alphabetically by key" - it seems that the question implies the same. 通过“排序”我的意思是“按键按字母顺序排序” - 似乎问题意味着相同。

Some implementations (each with its own strong and weak points): 一些实现(每个都有自己的优点和缺点):

Here's one option that has not been mentioned in other answers, I think: 我认为这是其他答案中没有提到的一个选项:

  • Use a binary search tree ( Tr e ap / AVL / RB ) to keep the mapping. 使用二叉搜索树( Tr e ap / AVL / RB )来保持映射。
  • Also use a hashmap (aka dictionary) to keep the same mapping (again). 还可以使用hashmap(又名字典)来保持相同的映射(再次)。

This will provide O(n) ordered traversal (via the tree), O(1) random access (via the hashmap) and O(log n) insertion/deletion (because you need to update both the tree and the hash). 这将提供O(n)有序遍历(通过树), O(1)随机访问(通过散列映射)和O(log n)插入/删除(因为您需要更新树和散列)。

The drawback is the need to keep all the data twice, however the alternatives which suggest keeping a list of keys alongside a hashmap are not much better in this sense. 缺点是需要将所有数据保留两次,但是在这种意义上,建议将键列表与哈希映射保持在一起的替代方案并不是更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM