[英]More efficient way to get unique first occurrence from a Python dict
I have a very large file I'm parsing and getting the key value from the line. 我有一个非常大的文件,我正在解析并从该行获取键值。 I want only the first key and value, for only one value. 我只想要第一个键和值,只有一个值。 That is, I'm removing the duplicate values 也就是说,我正在删除重复的值
So it would look like: 所以它看起来像:
{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}
and it would output: 它会输出:
{E:2,F:3,G:1}
It's a bit confusing because I don't really care what the key is. 这有点让人困惑,因为我并不在乎关键是什么。 So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A. 所以上面的E可以用B或D代替,F可以用C代替,G可以用A.代替。
Here is the best way I have found to do it but it is extremely slow as the file gets larger. 这是我发现的最佳方法,但随着文件变大,速度非常慢。
mapp = {}
value_holder = []
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.append(mydict[i])
Must look through value_holder every time :( Is there a faster way to do this? 每次都必须通过value_holder查看:(有更快的方法吗?
Yes, a trivial change makes it much faster: 是的,一个微不足道的变化使它更快:
value_holder = set()
(Well, you also have to change the append
to add
. But still pretty simple.) (好吧,你还必须改变append
add
。但还是很简单。)
Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). 使用集合而不是列表意味着每个查找都是O(1)而不是O(N),因此整个操作是O(N)而不是O(N ^ 2)。 In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons. 换句话说,如果您有10,000行,那么您将进行10,000次哈希查找而不是50,000,000次比较。
One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. 这个解决方案的一个警告 - 以及所有其他发布的 - 是它要求值可以清除。 If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (eg, from the blist
library). 如果它们不可清洗,但它们具有可比性,您仍然可以通过使用有序集(例如,来自blist
库)获得O(NlogN)而不是O(N ^ 2)。 If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions. 如果它们既不可清洗也不可排序......好吧,你可能想找到一些方法来生成可用的(或可排序的)用作“第一次检查”的东西,然后只用于实际匹配的“第一次检查”匹配,它将到达O(NM),其中M是散列冲突的平均数量。
You might want to look at how unique_everseen
is implemented in the itertools
recipes in the standard library documentation. 您可能希望了解标准库文档中itertools
配方中unique_everseen
的实现方式。
Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; 请注意,字典实际上没有订单,所以没有办法选择“第一”副本; you'll just get one arbitrarily. 你会随意得到一个。 In which case, there's another way to do this: 在这种情况下,还有另一种方法:
inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}
(This is effectively a form of the decorate-process-undecorate idiom without any processing.) (这实际上是装饰过程 - 未装饰成语的一种形式,没有任何处理。)
But instead of building up the dict
and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. 但是,不是建立dict
然后过滤它,你可以通过在阅读时过滤来使事情变得更好(更简单,更快,更节省内存,并保持秩序)。 Basically, keep the set
alongside the dict
as you go along. 基本上,随着时间的dict
,将set
放在dict
旁边。 For example, instead of this: 例如,而不是这样:
mydict = {}
for line in f:
k, v = line.split(None, 1)
mydict[k] = v
mapp = {}
value_holder = set()
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.add(mydict[i])
Just do this: 这样做:
mapp = {}
value_holder = set()
for line in f:
k, v = line.split(None, 1)
if v not in value_holder:
mapp[k] = v
value_holder.add(v)
In fact, you may want to consider writing a one_to_one_dict
that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write: 实际上,您可能需要考虑编写一个one_to_one_dict
它的one_to_one_dict
(或者搜索PyPI模块和ActiveState配方以查看是否有人已经为您编写了它),那么您可以编写:
mapp = one_to_one_dict()
for line in f:
k, v = line.split(None, 1)
mapp[k] = v
I'm not completely clear on exactly what you're doing, but set
is a great way to remove duplicates. 我并不完全清楚你正在做什么,但set
是删除重复项的好方法。 For example: 例如:
>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]
Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set
so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set
once. 虽然它取决于您正在加载的输入结构,但可能有一种方法可以简单地使用set
这样您就不必每次都遍历整个对象以查看是否有任何匹配的键 -而是通过set
一次运行它。
The first way to speed this up, as others have mentioned, is a using a set
to record seen values, as checking for membership on a set is much faster. 正如其他人所提到的,加快这一速度的第一种方法是使用一个set
来记录看到的值,因为检查集合上的成员资格要快得多。
We can also make this a lot shorter with a dict comprehension : 我们还可以通过词典理解缩短它:
seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}
The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or
a little bit hackishly to ensure any unseen values are added to the set. if情况需要一些解释:我们只在我们之前没有看到该值的地方添加键/值对,但是我们使用or
有点hackishly来确保将任何看不见的值添加到集合中。 As set.add()
returns None
, it will not affect the outcome. 由于set.add()
返回None
,因此不会影响结果。
As always, in 2.x, user dict.iteritems()
over dict.items()
. 与往常一样,在2.x中,用户dict.iteritems()
通过dict.items()
。
使用set
而不是list
会大大加快你的速度......
You said you are reading from a very large file and want to keep only the first occurrence of a key. 您说您正在读取一个非常大的文件,并且只想保留第一次出现的密钥。 I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. 我原先假设这意味着你关心密钥/值对在非常大的文件中出现的顺序。 This code will do that and will be fast. 这段代码会做到这一点并且会很快。
values_seen = set()
mapp = {}
with open("large_file.txt") as f:
for line in f:
key, value = line.split()
if value not in values_seen:
values_seen.add(value)
mapp[key] = value
You were using a list
to keep track of what keys your code had seen. 您正在使用list
来跟踪代码所看到的密钥。 Searching through a list
is very slow: it gets slower the larger the list gets. 搜索list
非常慢:列表越大,速度越慢。 A set
is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). 一个set
要快得多,因为查找接近于恒定时间(不要慢得多,或者可能慢一点,列表越大)。 (A dict
also works the way a set
works.) (一个dict
也可以按照set
的方式工作。)
Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. 你的部分问题是,当迭代遍历时,dicts不会保留任何类型的逻辑顺序。 They use hash tables to index items (see this great article ). 他们使用哈希表来索引项目(请参阅这篇精彩的文章 )。 So there's no real concept of "first occurence of value" in this sort of data structure. 因此,在这种数据结构中没有“第一次出现价值”的真实概念。 The right way to do this would probably be a list of key-value pairs. 执行此操作的正确方法可能是键值对列表。 eg : 例如:
kv_pairs = [(k1,v1),(k2,v2),...]
or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs: 或者,因为文件太大,最好使用python提供的优秀文件迭代来检索k / v对:
def kv_iter(f):
# f being the file descriptor
for line in f:
yield ... # (whatever logic you use to get k, v values from a line)
Value_holder is a great candidate for a set variable. Value_holder是set变量的理想选择。 You are really just testing whether value_holder. 你真的只是在测试value_holder。 Because values are unique, they can be indexed more efficiently using a similar hashing method. 由于值是唯一的,因此可以使用类似的散列方法更有效地索引它们。 So it would end up a bit like this: 所以最终会有点像这样:
mapp = {}
value_holder = set()
for k,v in kv_iter(f):
if v in value_holder:
mapp[k] = v
value_holder.add(v)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.