简体   繁体   English

Python:更新元组列表......最快的方法

[英]Python: update a list of tuples… fastest method

This question is in relation to another question asked here: Sorting 1M records 这个问题与此处提出的另一个问题有关: 排序1M记录

I have since figured out the problem I was having with sorting. 我已经弄清楚了排序时遇到的问题。 I was sorting items from a dictionary into a list every time I updated the data. 每次更新数据时,我都会将字典中的项目排序到列表中。 I have since realized that a lot of the power of Python's sort resides in the fact that it sorts data more quickly that is already partially sorted. 我已经意识到Python排序的很多功能在于它可以更快地对已经部分排序的数据进行排序。

So, here is the question. 所以,这是问题所在。 Suppose I have the following as a sample set: 假设我有以下作为样本集:

self.sorted_records = [(1, 1234567890), (20, 1245678903), 
                       (40, 1256789034), (70, 1278903456)]

t[1] of each tuple in the list is a unique id. 列表中每个元组的t[1]是唯一的id。 Now I want to update this list with the follwoing: 现在我想用下面的内容更新这个列表:

updated_records = {1245678903:45, 1278903456:76}

What is the fastest way for me to do so ending up with 对我来说最快的方式是什么?

self.sorted_records = [(1, 1234567890), (45, 1245678903),
                       (40, 1256789034), (76, 1278903456)]

Currently I am doing something like this: 目前我正在做这样的事情:

updated_keys = updated_records.keys()
for i, record in enumerate(self.sorted_data):
    if record[1] in updated_keys:
        updated_keys.remove(record[1])
        self.sorted_data[i] = (updated_records[record[1]], record[1])

But I am sure there is a faster, more elegant solution out there. 但我相信那里有一个更快,更优雅的解决方案。

Any help? 有帮助吗?

* edit *编辑 It turns out I used bad exaples for the ids since they end up in sorted order when I do my update. 事实证明我使用了错误的exids,因为当我进行更新时它们按排序顺序排列。 I am actually interested in t[0] being in sorted order. 我实际上对t [0]按排序顺序感兴趣。 After I do the update I was intending on resorting with the updated data, but it looks like bisect might be the ticket to insert in sorted order. 在我进行更新后,我打算使用更新的数据,但看起来bisect可能是按排序顺序插入的票证。 end edit * 结束编辑*

You're scanning through all n records. 您正在扫描所有n条记录。 You could instead do a binary search, which would be O(log(n)) instead of O(n). 您可以改为执行二进制搜索,即O(log(n))而不是O(n)。 You can use the bisect module to do this. 您可以使用bisect模块执行此操作。

Since apparently you don't care about the ending value of self.sorted_records actually being sorted (you have values in order 1, 45, 20, 76 -- that's NOT sorted!-), AND you only appear to care about IDs in updated_records that are also in self.sorted_data , a listcomp (with side effects if you want to change the updated_record on the fly) would serve you well, ie: 因为显然你不关心的结束值self.sorted_records实际上排序(你有值,以1,45,20,76 -这是没有排序- !),你只能似乎关心的ID updated_recordsself.sorted_data ,listcomp(如果你想动态更改updated_record,还有副作用)会很好地为你服务,即:

self.sorted_data = [(updated_records.pop(recid, value), recid) 
                    for (value, recid) in self.sorted_data]

the .pop call removes from updated_records the keys (and corresponding values) that are ending up in the new self.sorted_data (and the "previous value for that recid ", value , is supplied as the 2nd argument to pop to ensure no change where a recid is NOT in updated_record ); 所述.pop呼叫从删除updated_records了在新的结束了的密钥(和相应的值) self.sorted_data (和“该先前值recid ”, value被提供作为第二参数,弹出,以确保没有变化,其中recid不在updated_record ); this leaves in updated_record the "new" stuff so you can eg append it to self.sorted_data before re-sorting, ie I suspect you want to continue with something like 这将在updated_record留下“新”内容,因此您可以在重新排序之前将其附加到self.sorted_data ,即我怀疑您想继续使用类似的内容

self.sorted_data.extend(value, recid 
                        for recid, value in updated_records.iteritems())
self.sorted_data.sort()

though this part DOES go beyond the question you're actually asking (and I'm giving it only because I've seen your previous questions;-). 虽然这部分超出了你实际问的问题(而且我只是因为我看过你以前的问题而给予它;-)。

You'd probably be best served by some form of tree here (preserving sorted order while allowing O(log n) replacements.) There is no builtin balanaced tree type, but you can find many third party examples. 你可能最好通过某种形式的树来服务(保留排序顺序,同时允许O(log n)替换。)没有内置的balanaced树类型,但你可以找到许多第三方的例子。 Alternatively, you could either: 或者,你可以:

  1. Use a binary search to find the node. 使用二进制搜索来查找节点。 The bisect module will do this, but it compares based on the normal python comparison order, whereas you seem to be sorted based on the second element of each tuple. bisect模块将执行此操作,但它会根据正常的python比较顺序进行比较,而您似乎根据每个元组的第二个元素进行排序。 You could reverse this, or just write your own binary search (or simply take the code from bisect_left and modify it) 您可以撤消此操作,或者只编写您自己的二进制搜索(或者只是从bisect_left获取代码并修改它)

  2. Use both a dict and a list. 同时使用字典列表。 The list contains the sorted keys only. 该列表仅包含已排序的 You can wrap the dict class easily enough to ensure this is kept in sync. 您可以轻松地将dict类包装起来以确保它保持同步。 This allows you fast dict updating while maintaining sort order of the keys. 这允许您在保持键的排序顺序的同时快速更新字典。 This should prevent your problem of losing sort performance due to constant conversion between dict/list. 这可以防止由于dict / list之间的持续转换而导致排序性能丢失的问题。

Here's a quick implementation of such a thing: 这是一个快速实现这样的事情:

import bisect

class SortedDict(dict):
    """Dictionary which is iterable in sorted order.

    O(n) sorted iteration
    O(1) lookup
    O(log n) replacement  ( but O(n) insertion or new items)
    """

    def __init__(self, *args, **kwargs):
        dict.__init__(self, *args, **kwargs)
        self._keys = sorted(dict.iterkeys(self))

    def __setitem__(self, key, val):
        if key not in self:
            # New key - need to add to list of keys.
            pos = bisect.bisect_left(self._keys, key)
            self._keys.insert(pos, key)
        dict.__setitem__(self, key, val)

    def __delitem__(self, key):
        if key in self:
            pos = bisect.bisect_left(self._keys, key)
            del self._keys[pos]
        dict.__delitem__(self, key)

    def __iter__(self):
        for k in self._keys: yield k
    iterkeys = __iter__

    def iteritems(self):
        for k in self._keys: yield (k, self[k])

    def itervalues(self):
        for k in self._keys: yield self[k]

    def update(self, other):
        dict.update(self, other)
        self._keys = sorted(dict.iterkeys(self)) # Rebuild (faster if lots of changes made - may be slower if only minor changes to large dict)

    def keys(self): return list(self.iterkeys())
    def values(self): return list(self.itervalues())
    def items(self): return list(self.iteritems())

    def __repr__(self):
        return "%s(%s)" % (self.__class__.__name__, ', '.join("%s=%r" % (k, self[k]) for k in self))

Since you want to replace by dictionary key, but have the array sorted by dictionary value, you definitely need a linear search for the key. 由于您希望使用字典键替换,但是按字典值排序数组,因此您肯定需要对键进行线性搜索。 In that sense, your algorithm is the best you can hope for. 从这个意义上讲,您的算法是您所希望的最佳算法。

If you would preserve the old dictionary value, then you could use a binary search for the value, and then locate the key in the proximity of where the binary search lead you. 如果要保留旧字典值,则可以使用二进制搜索值,然后在二进制搜索引导您的位置附近找到密钥。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM