简体   繁体   English

Python:从列表中删除大量项目

[英]Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. 我正处于一直在进行的项目的最后阶段。 Everything is running smoothly but I have a bottleneck that I am having trouble working around. 一切都运行顺利,但我有一个瓶颈,我无法解决。

I have a list of tuples. 我有一个元组列表。 The list ranges in length from say 40,000 - 1,000,000 records. 该列表的长度范围为40,000-1,000,000条记录。 Now I have a dictionary where each and every (value, key) is a tuple in the list. 现在我有一个字典,其中每个(值,键)都是列表中的元组。

So, I might have 所以,我可能会

myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}

I want to remove each (v, k) tuple from the list. 我想从列表中删除每个(v,k)元组。

Currently I am doing: 目前我在做:

for k, v in myDict.iteritems():
    myList.remove((v, k))

Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. 从包含20,000个元组的列表中删除838个元组需要3到4秒。 I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster. 我很可能会从1,000,000的列表中删除更多像10,000个元组,所以我需要更快。

Is there a better way to do this? 有一个更好的方法吗?

I can provide code used to test, plus pickled data from the actual application if needed. 我可以提供用于测试的代码,如果需要,还可以提供实际应用程序中的pickle数据。

You'll have to measure, but I can imagine this to be more performant: 你必须衡量,但我可以想象这是更高效的:

myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)

because the lookup happens in the dict, which is more suited for this kind of thing. 因为查找发生在dict中,这更适合这种事情。 Note, though, that this will create a new list before removing the old one; 但请注意,这将在删除旧列表之前创建一个新列表; so there's a memory tradeoff. 所以有一个记忆权衡。 If that's an issue, rethinking your container type as jkp suggest might be in order. 如果这是一个问题,重新考虑您的容器类型为jkp建议可能是有序的。

Edit : Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder." 编辑 :但要小心,如果列表中实际上None - 您必须使用不同的“占位符”。

To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be: 要从大约1,000,000的列表中删除大约10,000个元组,如果值是可清除的,则最快的方法应该是:

totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]

The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. 该套装的准备是一次性成本很小,很多时候都会节省进行元组拆包和重新打包或元组索引的操作。 Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents !-). 分配给myList[:]而不是分配给myList在语义上也很重要(如果有任何其他对myList引用, myList重新绑定名称是不够的 - 你真的想要重新绑定内容 ! - )。

I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data! 我自己没有测试数据来进行时间测量,唉!,但是,让我知道它如何在我们的测试数据上发挥作用!

If the values are not hashable (eg they're sub-lists, for example), fastest is probably: 如果值不可清除(例如,它们是子列表),则最快可能是:

sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]

or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking): 或者也许(不应该在任何方面产生很大的影响,但我怀疑前一个更好 - 索引比解包和重新打包更便宜):

sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]

In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict). 在这两个变体中,sentinel习惯用于抵御None值(对于首选的基于集合的方法,这不是问题 - 如果值是可以清除的!),因为它比if a not in myDict or myDict[a] != b更便宜if a not in myDict or myDict[a] != b (需要两个索引进入myDict)。

Every time you call myList.remove , Python has to scan over the entire list to search for that item and remove it. 每次调用myList.remove ,Python都必须扫描整个列表以搜索该项并将其删除。 In the worst case scenario, every item you look for would be at the end of the list each time. 在最糟糕的情况下,您查找的每个项目每次都会在列表的末尾。

Have you tried doing the "inverse" operation of: 你有没有尝试过“反向”操作:

newMyList = [(v,k) for (v,k) in myList if not k in myDict]

But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there. 但是我真的不确定这种扩展程度如何,因为你要制作原始列表的副本 - 可能会占用很多内存。

Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach. 这里最好的替代方案可能是等待Alex Martelli发布一些令人兴奋的直观,简单和高效的方法。

The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. 这个问题在我看来是你使用list作为你想要删除的容器的事实,它是一个完全无序的类型。 So to find each item in the list is a linear operation ( O(n) ), it has to iterate over the whole list until it finds a match. 因此,要查找列表中的每个项目是线性操作( O(n) ),它必须迭代整个列表,直到找到匹配项。

If you could swap the list for some other container ( set ?) which uses a hash() of each item to order them, then each match could be performed much quicker. 如果您可以将list交换为使用每个项目的hash()进行排序的其他容器( set ?),则可以更快地执行每个匹配。

The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread: 以下代码显示了如何使用我和Nick在此主题上提供的各种想法来实现此目的:

list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
    if item in difference_set:
        final_list.append(item)
[(i, j) for i, j in myList if myDict.get(j) != i]

Try something like this: 尝试这样的事情:

myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)

This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. 这会将myList转换为一个集合,将交换myDict的键/值并将它们放入一个集合中,然后找到差异,将其转回列表,然后将其分配回myList。 :) :)

[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]

A list containing a million 2-tuples is not large on most machines running Python. 在运行Python的大多数机器上,包含一百万个2元组的列表并不大。 However if you absolutely must do the removal in situ, here is a clean way of doing it properly: 但是,如果您绝对必须在原地进行移除,这是一种干净的方法:

def filter_by_dict(my_list, my_dict):
    sentinel = object()
    for i in xrange(len(my_list) - 1, -1, -1):
        key = my_list[i][1]
        if my_dict.get(key, sentinel) is not sentinel:
            del my_list[i]

Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2) . 更新实际上每个del花费O(n)使用C的memmove()将列表指针向下移动,所以如果有d dels,则它是O(n*d)而不是O(n**2) Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. 注意(1)OP表明d约== 0.01 * n和(2) O(n*d)努力将一个指针复制到内存中的其他位置......所以这种方法实际上可能比快速浏览一下就可以了。 Benchmarks, anyone? 基准,有人吗?

What are you going to do with the list after you have removed the items that are in the dict? 删除dict中的项目后,您要对列表执行什么操作? Is it possible to piggy-back the dict-filtering onto the next step? 是否可以将字典过滤捎带到下一步?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM