[英]Python - Group duplicates in a list of lists by index
I've seen a lot of questions about removing duplicates from a list and counting them. 我已经看到很多关于从列表中删除重复项并计算它们的问题。 But I'm trying to find the best way to group them - for a list of lists. 但我正试图找到将它们分组的最佳方法 - 列表列表。
Given this example I want to group by the third field: 鉴于此示例,我想按第三个字段分组:
[[1, "text", "name1", "text"],
[2, "text", "name2", "text"],
[3, "text", "name2", "text"],
[4, "text", "name1", "text"]]
I'd like to get this: 我想得到这个:
[[[1, "text", "name1", "text"],
[4, "text", "name1", "text"]],
[[2, "text", "name2", "text"],
[3, "text", "name2", "text"]]]
I can think of the naive way by looping through and just keeping track of what is found (O(n^2)). 我可以通过循环来思考天真的方式并且只是跟踪找到的内容(O(n ^ 2))。 But I would assume there's a better way. 但我认为有更好的方法。
You could sorted and use groupby but that is O(n log n)
: 您可以对groupby进行排序和使用,但这是O(n log n)
:
from operator import itemgetter
from itertools import groupby
print([list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))])
Or use an OrderedDict grouping by the third element for an O(n)
solution by using the third element as the key and appending the sublists as values. 或者使用第三个元素的OrderedDict分组为O(n)
解决方案,使用第三个元素作为键并将子列表作为值附加。 setdefault will handle the repeated keys: setdefault将处理重复的键:
from collections import OrderedDict
od = OrderedDict()
for sub in l:
od.setdefault(sub[2],[]).append(sub)
from pprint import pprint as pp
pp(od.values())
[[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]]
If order does not matter you can use a defaultdict in place of the OrderedDict. 如果顺序无关紧要,您可以使用defaultdict代替OrderedDict。
If order does not matter a defaultdict is by far the most efficient. 如果顺序无关紧要,则defaultdict是最有效的。
In [7]: from itertools import groupby
In [8]: from collections import OrderedDict, defaultdict
In [9]: l = [[1, "text", "name{}".format(choice(list(range(2000)))), "text"] for _ in xrange(40000)]
In [13]: from operator import itemgetter
In [14]: timeit [list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))]
10 loops, best of 3: 42.5 ms per loop
In [15]: %%timeit
od = defaultdict(list)
for sub in l:
od[sub[2]].append(sub)
....:
100 loops, best of 3: 9.42 ms per loop
In [16]: %%timeit
od = OrderedDict()
for sub in l:
od.setdefault(sub[2],[]).append(sub)
....:
10 loops, best of 3: 25.5 ms per loop
In [17]: lists = l
In [18]: %%timeit
....: groupers = set(l[2] for l in lists)
....: [filter(lambda x: x[2] == y, lists) for y in groupers]
....:
1 loops, best of 3: 8.48 s per loop
In [19]: timeit l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
1 loops, best of 3: 8.29 s per loop
So if order does not matter then defaultdict wins, groupby still performs pretty well as sort is still pretty cheap in comparison to a quadratic approach. 因此,如果顺序无关紧要,则defaultdict获胜,groupby仍然表现得相当好,因为与二次方法相比,sort仍然相当便宜。 As you can see filter's quadratic complexity performs badly as the data grows. 正如您所看到的,过滤器的二次复杂度随着数据的增长而表现不佳。
Here you go: 干得好:
>>> lists = [[1, "text", "name1", "text"],
... [2, "text", "name2", "text"],
... [3, "text", "name2", "text"],
... [4, "text", "name1", "text"]]
>>> groupers = set(l[2] for l in lists)
>>> groupers
set(['name2', 'name1'])
>>> l = [filter(lambda x: x[2] == y, lists) for y in groupers]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]
You can of course write the whole grouping logic in a single line: 您当然可以将整个分组逻辑写在一行中:
>>> l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]
The easiest way of doing that is with the key
argument of the sorted()
function. 最简单的方法是使用sorted()
函数的key
参数。 In your example: 在你的例子中:
>>> a = [[1, "text", "name1", "text"], [2, "text", "name2", "text"], [3, "text", "name2", "text"], [4, "text", "name1", "text"]]
>>> sorted(a[:], key=lambda item:item[2])
>>> [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text'], [2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]
You can find more information about this argument on this link . 您可以在此链接上找到有关此参数的更多信息。
Use sorted
with element you want to sort on as key
and itertools groupby
to group 'em: 使用您要sorted
的元素作为key
进行排序,并使用itertools groupby
进行分组:
>>> from itertools import groupby
>>> sl = sorted(your_list, key=lambda your_list: your_list[2])
>>> [list(v) for k,v in groupby(sl, key=lambda sl:sl[2])]
[[[1, 'text', 'name1', 'text'],
[4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'],
[3, 'text', 'name2', 'text']]]
The following function will quickly ( no sorting required) group sub-sequences of any length by a key of specified index : 以下函数将快速( 无需排序 )通过指定索引的键对任何长度的子序列进行分组:
# given a sequence of sequences like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping sequences by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4), 'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by_idx(seqs,idx=0,merge=True):
d = dict()
for seq in seqs:
if isinstance(seq,tuple): seq_kind = tuple
if isinstance(seq,list): seq_kind = list
k = seq[idx]
v = d.get(k,seq_kind()) + (seq[:idx]+seq[idx+1:] if merge else seq_kind((seq[:idx]+seq[idx+1:],)))
d.update({k:v})
return d
In the case of your question, the key is the element having index 2, therefore 在您的问题的情况下,键是具有索引2的元素,因此
group_by_idx(your_list,2,False)
gives 给
{'name1': [[1, 'text', 'text'], [4, 'text', 'text']],
'name2': [[2, 'text', 'text'], [3, 'text', 'text']]}
which is not exactly the output you asked for, but might as well suit your needs. 这不完全是您要求的输出,但可能也适合您的需求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.