简体   繁体   English

Python - 按索引在列表列表中复制组

[英]Python - Group duplicates in a list of lists by index

I've seen a lot of questions about removing duplicates from a list and counting them. 我已经看到很多关于从列表中删除重复项并计算它们的问题。 But I'm trying to find the best way to group them - for a list of lists. 但我正试图找到将它们分组的最佳方法 - 列表列表。

Given this example I want to group by the third field: 鉴于此示例,我想按第三个字段分组:

[[1, "text", "name1", "text"],
 [2, "text", "name2", "text"],
 [3, "text", "name2", "text"],
 [4, "text", "name1", "text"]]

I'd like to get this: 我想得到这个:

[[[1, "text", "name1", "text"],
  [4, "text", "name1", "text"]],
 [[2, "text", "name2", "text"],
  [3, "text", "name2", "text"]]]

I can think of the naive way by looping through and just keeping track of what is found (O(n^2)). 我可以通过循环来思考天真的方式并且只是跟踪找到的内容(O(n ^ 2))。 But I would assume there's a better way. 但我认为有更好的方法。

You could sorted and use groupby but that is O(n log n) : 您可以对groupby进行排序和使用,但这是O(n log n)

from operator import itemgetter
from itertools import groupby

print([list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))])

Or use an OrderedDict grouping by the third element for an O(n) solution by using the third element as the key and appending the sublists as values. 或者使用第三个元素的OrderedDict分组为O(n)解决方案,使用第三个元素作为键并将子列表作为值附加。 setdefault will handle the repeated keys: setdefault将处理重复的键:

from collections import OrderedDict

od = OrderedDict()

for sub in l:
    od.setdefault(sub[2],[]).append(sub)
from pprint import pprint as pp
pp(od.values())
[[[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']],
[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]]

If order does not matter you can use a defaultdict in place of the OrderedDict. 如果顺序无关紧要,您可以使用defaultdict代替OrderedDict。

If order does not matter a defaultdict is by far the most efficient. 如果顺序无关紧要,则defaultdict是最有效的。

In [7]: from itertools import groupby

In [8]: from collections import OrderedDict, defaultdict                               

In [9]: l = [[1, "text", "name{}".format(choice(list(range(2000)))), "text"] for _ in xrange(40000)]

In [13]: from operator import  itemgetter

In [14]: timeit [list(v) for _,v in groupby( sorted(l,key=itemgetter(2)),itemgetter(2))]
10 loops, best of 3: 42.5 ms per loop

In [15]: %%timeit                                                                       
od = defaultdict(list)
for sub in l:
    od[sub[2]].append(sub)
   ....: 
100 loops, best of 3: 9.42 ms per loop

In [16]: %%timeit                                                                       
od = OrderedDict()
for sub in l:
     od.setdefault(sub[2],[]).append(sub)
   ....: 
10 loops, best of 3: 25.5 ms per loop

In [17]: lists = l

In [18]: %%timeit
   ....: groupers = set(l[2] for l in lists)
   ....: [filter(lambda x: x[2] == y, lists) for y in groupers]
   ....: 

1 loops, best of 3: 8.48 s per loop

In [19]: timeit l = [filter(lambda x: x[2] == y, lists) for y in   set(l[2] for l in lists)]
1 loops, best of 3: 8.29 s per loop

So if order does not matter then defaultdict wins, groupby still performs pretty well as sort is still pretty cheap in comparison to a quadratic approach. 因此,如果顺序无关紧要,则defaultdict获胜,groupby仍然表现得相当好,因为与二次方法相比,sort仍然相当便宜。 As you can see filter's quadratic complexity performs badly as the data grows. 正如您所看到的,过滤器的二次复杂度随着数据的增长而表现不佳。

Here you go: 干得好:

>>> lists = [[1, "text", "name1", "text"],
...  [2, "text", "name2", "text"],
...  [3, "text", "name2", "text"],
...  [4, "text", "name1", "text"]]
>>> groupers = set(l[2] for l in lists)
>>> groupers
set(['name2', 'name1'])
>>> l = [filter(lambda x: x[2] == y, lists) for y in groupers]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
 [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]

You can of course write the whole grouping logic in a single line: 您当然可以将整个分组逻辑写在一行中:

>>> l = [filter(lambda x: x[2] == y, lists) for y in set(l[2] for l in lists)]
>>> pprint.pprint(l)
[[[2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']],
 [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text']]]

The easiest way of doing that is with the key argument of the sorted() function. 最简单的方法是使用sorted()函数的key参数。 In your example: 在你的例子中:

>>> a = [[1, "text", "name1", "text"], [2, "text", "name2", "text"], [3, "text", "name2", "text"], [4, "text", "name1", "text"]]

>>> sorted(a[:], key=lambda item:item[2])

>>> [[1, 'text', 'name1', 'text'], [4, 'text', 'name1', 'text'], [2, 'text', 'name2', 'text'], [3, 'text', 'name2', 'text']]

You can find more information about this argument on this link . 您可以在此链接上找到有关此参数的更多信息。

Use sorted with element you want to sort on as key and itertools groupby to group 'em: 使用您要sorted的元素作为key进行排序,并使用itertools groupby进行分组:

>>> from itertools import groupby
>>> sl = sorted(your_list, key=lambda your_list: your_list[2])
>>> [list(v) for k,v in groupby(sl, key=lambda sl:sl[2])]
[[[1, 'text', 'name1', 'text'], 
  [4, 'text', 'name1', 'text']], 
 [[2, 'text', 'name2', 'text'], 
  [3, 'text', 'name2', 'text']]]

The following function will quickly ( no sorting required) group sub-sequences of any length by a key of specified index : 以下函数将快速( 无需排序 )通过指定索引的键对任何长度的子序列进行分组:

# given a sequence of sequences like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping sequences by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4),     'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by_idx(seqs,idx=0,merge=True):
    d = dict()
    for seq in seqs:
        if isinstance(seq,tuple): seq_kind = tuple
        if isinstance(seq,list): seq_kind = list
        k = seq[idx]
        v = d.get(k,seq_kind()) + (seq[:idx]+seq[idx+1:] if merge else seq_kind((seq[:idx]+seq[idx+1:],)))
        d.update({k:v})
    return d

In the case of your question, the key is the element having index 2, therefore 在您的问题的情况下,键是具有索引2的元素,因此

group_by_idx(your_list,2,False)

gives

{'name1': [[1, 'text', 'text'], [4, 'text', 'text']],
 'name2': [[2, 'text', 'text'], [3, 'text', 'text']]}

which is not exactly the output you asked for, but might as well suit your needs. 这不完全是您要求的输出,但可能也适合您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM