简体   繁体   English

从列表列表中删除重复项

[英]Removing duplicates from a list of lists

I have a list of lists in Python:我有一个 Python 列表列表:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it.我想从中删除重复的元素。 Was if it a normal list not of lists I could used set .如果它不是我可以使用的列表的普通列表set But unfortunate that list is not hashable and can't make set of lists.但不幸的是,该列表不可散列并且无法制作列表集。 Only of tuples.只有元组。 So I can turn all lists to tuples then use set and back to lists.所以我可以将所有列表转换为元组,然后使用 set 并返回到列表。 But this isn't fast.但这并不快。

How can this done in the most efficient way?如何以最有效的方式做到这一点?

The result of above list should be:上面列表的结果应该是:

k = [[5, 6, 2], [1, 2], [3], [4]]

I don't care about preserve order.我不在乎维持秩序。

Note: this question is similar but not quite what I need.注意: 这个问题很相似,但不是我需要的。 Searched SO but didn't find exact duplicate.搜索了 SO,但没有找到完全重复的内容。


Benchmarking:基准测试:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

"loop in" (quadratic method) fastest of all for short lists. “循环”(二次方法)对于短列表来说是最快的。 For long lists it's faster then everyone except groupby method.对于长列表,除了 groupby 方法之外,它比所有人都快。 Does this make sense?这有意义吗?

For short list (the one in the code), 100000 iterations:对于短列表(代码中的那个),100000 次迭代:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

For longer list (the one in the code duplicated 5 times):对于更长的列表(代码中的那个重复了 5 次):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-) itertools通常为这类问题提供最快和最强大的解决方案,非常值得深入熟悉!-)

Edit : as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it's so much easier that it offers good returns on efforts.编辑:正如我在评论中提到的,正常的优化工作集中在大输入(大 O 方法),因为它更容易提供良好的努力回报。 But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that's pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one's apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.但有时(主要是对于推动性能限制边界的代码深层内部循环中的“可悲的关键瓶颈”)可能需要更详细地提供概率分布,决定要优化哪些性能度量(可能是上限或第 90 个百分位数比平均值或中位数更重要,具体取决于一个人的应用程序),在开始时执行可能的启发式检查以根据输入数据特征选择不同的算法,等等。

Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here.仔细测量“点”性能(特定输入的代码 A 与代码 B)是这个极其昂贵的过程的一部分,标准库模块timeit在这里有所帮助。 However, it's easier to use it at a shell prompt.但是,在 shell 提示符下使用它更容易。 For example, here's a short module to showcase the general approach for this problem, save it as nodup.py :例如,这里有一个简短的模块来展示这个问题的一般方法,将它保存为nodup.py

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
  return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
  ks = sorted(k)
  return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
  ks = sorted(k)
  return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
  newk = []
  for i in k:
    if i not in newk:
      newk.append(i)
  return newk

# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
  savek = list(k)
  for f in doset, dosort, dogroupby, donewk:
    resk = f(k)
    assert k == savek
    print '%10s %s' % (f.__name__, sorted(resk))

Note the sanity check (performed when you just do python nodup.py ) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.请注意完整性检查(在您执行python nodup.py时执行)和基本提升技术(为每个函数设置本地的常量全局名称以提高速度)以使事情处于平等地位。

Now we can run checks on the tiny example list:现在我们可以对小示例列表进行检查:

$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values.确认二次方法具有足够小的常数,使其对具有很少重复值的小列表具有吸引力。 With a short list without duplicates:有一个没有重复的简短列表:

$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop

the quadratic approach isn't bad, but the sort and groupby ones are better.二次方法还不错,但 sort 和 groupby 方法更好。 Etc, etc.等等等等。

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it's worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).如果(正如对性能的痴迷所暗示的那样)此操作位于突破边界应用程序的核心内部循环中,则值得尝试对其他代表性输入样本进行相同的测试集,可能会检测到一些简单的度量,可以启发式地让您选择一种或另一种方法(但措施必须很快,当然)。

It's also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place?还值得考虑为k保留不同的表示形式——为什么它必须是列表的列表而不是一组元组? If the duplicate removal task is frequent, and profiling shows it to be the program's performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.例如,如果重复删除任务很频繁,并且分析表明它是程序的性能瓶颈,那么始终保留一组元组并仅在需要时从其中获取列表列表,总体上可能会更快。

Doing it manually, creating a new k list and adding entries not found so far:手动执行,创建一个新的k列表并添加迄今为止未找到的条目:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
    if elem not in new_k:
        new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]

Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it's quadratic in complexity as you're searching the whole of new_k for each element.易于理解,并且您保留每个元素第一次出现的顺序应该有用,但我猜它的复杂性是二次的,因为您正在为每个元素搜索整个new_k

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]

I don't know if it's necessarily faster, but you don't have to use to tuples and sets.我不知道它是否一定会更快,但您不必使用元组和集合。

List of tuple and {} can be used to remove duplicates元组列表和 {} 可用于删除重复项

>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>> 

Even your "long" list is pretty short.甚至您的“长”列表也很短。 Also, did you choose them to match the actual data?另外,您是否选择它们以匹配实际数据? Performance will vary with what these data actually look like.性能会因这些数据的实际外观而异。 For example, you have a short list repeated over and over to make a longer list.例如,您有一个简短的列表一遍又一遍地重复以形成一个更长的列表。 This means that the quadratic solution is linear in your benchmarks, but not in reality.这意味着二次解在您的基准测试中是线性的,但实际上并非如此。

For actually-large lists, the set code is your best bet—it's linear (although space-hungry).对于实际较大的列表,设置代码是您最好的选择——它是线性的(尽管空间很紧)。 The sort and groupby methods are O(n log n) and the loop in method is obviously quadratic, so you know how these will scale as n gets really big. sort 和 groupby 方法是 O(n log n) 并且方法中的循环显然是二次的,所以你知道当 n 变得非常大时这些将如何缩放。 If this is the real size of the data you are analyzing, then who cares?如果这是您正在分析的数据的真实大小,那么谁在乎呢? It's tiny.它很小。

Incidentally, I'm seeing a noticeable speedup if I don't form an intermediate list to make the set, that is to say if I replace顺便说一句,如果我不形成中间列表来制作集合,我会看到明显的加速,也就是说,如果我替换

kt = [tuple(i) for i in k]
skt = set(kt)

with

skt = set(tuple(i) for i in k)

The real solution may depend on more information: Are you sure that a list of lists is really the representation you need?真正的解决方案可能取决于更多信息:您确定列表列表真的是您需要的表示吗?

All the set -related solutions to this problem thus far require creating an entire set before iteration.到目前为止,该问题的所有与set相关的解决方案都需要在迭代之前创建一个完整的set

It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a "seen" set .可以通过迭代列表列表并添加到“已见” set来使其变得懒惰,同时保持顺序。 Then only yield a list if it is not found in this tracker set .如果在这个跟踪器set没有找到它,那么只产生一个列表。

This unique_everseen recipe is available in the itertools docs .这个unique_everseen配方可在itertools 文档中找到。 It's also available in the 3rd party toolz library:它也可以在 3rd 方toolz库中使用:

from toolz import unique

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

# lazy iterator
res = map(list, unique(map(tuple, k)))

print(list(res))

[[1, 2], [4], [5, 6, 2], [3]]

Note that tuple conversion is necessary because lists are not hashable.请注意, tuple转换是必要的,因为列表不可散列。

a_list = [
          [1,2],
          [1,2],
          [2,3],
          [3,4]
]

print (list(map(list,set(map(tuple,a_list)))))

outputs: [[1, 2], [3, 4], [2, 3]]输出: [[1, 2], [3, 4], [2, 3]]

Create a dictionary with tuple as the key, and print the keys.创建一个以元组为键的字典,并打印键。

  • create dictionary with tuple as key and index as value创建以元组为键,索引为值的字典
  • print list of keys of dictionary打印字典键列表

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

dict_tuple = {tuple(item): index for index, item in enumerate(k)}

print [list(itm) for itm in dict_tuple.keys()]

# prints [[1, 2], [5, 6, 2], [3], [4]]

This should work.这应该有效。

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

k_cleaned = []
for ele in k:
    if set(ele) not in [set(x) for x in k_cleaned]:
        k_cleaned.append(ele)
print(k_cleaned)

# output: [[1, 2], [4], [5, 6, 2], [3]]

Strangely, the answers above removes the 'duplicates' but what if I want to remove the duplicated value also??奇怪的是,上面的答案删除了“重复项”,但是如果我也想删除重复的值怎么办? The following should be useful and does not create a new object in memory!以下应该是有用的并且不会在内存中创建新对象!

def dictRemoveDuplicates(self):
    a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]


print(a)
temp = 0
position = -1
for pageNo, item in a:
    position+=1
    if pageNo != temp:
        temp = pageNo
        continue
    else:
        a[position] = 0
        a[position - 1] = 0
a = [x for x in a if x != 0]         
print(a)

and the o/p is:并且o / p是:

[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]

A bit of a background, I just started with python and learnt comprehensions.有点背景,我刚开始使用 python 并学习了理解。

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
dedup = [elem.split('.') for elem in set(['.'.join(str(int_elem) for int_elem in _list) for _list in k])]

The simplest solution is to convert a list of lists into a list of tuples and then apply dict.fromkeys() method then convert it back to the list.最简单的解决方案是将列表列表转换为元组列表,然后应用dict.fromkeys()方法,然后将其转换回列表。

for example:例如:

you have k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]你有k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

Convert to list of tuples k = list(map(tuple, k))转换为元组k = list(map(tuple, k))

This will give you [(1, 2), (4,), (5, 6, 2), (1, 2), (3,), (4,)]这会给你[(1, 2), (4,), (5, 6, 2), (1, 2), (3,), (4,)]

Then do the following: unique = list(dict.fromkeys(k))然后执行以下操作: unique = list(dict.fromkeys(k))

You will have [(1, 2), (4,), (5, 6, 2), (3,)]你将有[(1, 2), (4,), (5, 6, 2), (3,)]

That's all.就这样。

If the complaint is not with 'not fast' per se but with 'not concise enough' part of your proposed solution, then in Python 3.5+ with help of unpacking operator and concise tuple notation you can make chained data structure conversions extremely brief (granted, this is still O(n^2), but still unpacking is slightly faster than direct conversion):如果抱怨不是“不快”本身,而是您提出的解决方案的“不够简洁”部分,那么在 Python 3.5+ 中,借助解包运算符和简洁的元组表示法,您可以使链式数据结构转换非常简短(授予,这仍然是 O(n^2),但仍然解包比直接转换稍快):

Input:输入:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
k = [*map(list, {*map(tuple, k)})]

# If you prefer comprehensions to map()
# k = [[*t] for t in {(*l,) for l in k}]

# Order-preserving alternative:
# k = [*map(list, dict.fromkeys(map(tuple, k)))]

print(k)

Output:输出:

[[1, 2], [4], [5, 6, 2], [3]]

Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:另一个可能更通用和更简单的解决方案是创建一个以对象的字符串版本为键的字典,并在最后获取 values():

>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]

The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).问题是这只适用于字符串表示是一个足够好的唯一键的对象(对于大多数本机对象都是如此)。

k=[[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [3], [8], [9]]
kl=[]
kl.extend(x for x in k if x not in kl)
k=list(kl)
print(k)

which prints,打印,

[[1, 2], [4], [5, 6, 2], [3], [5, 2], [8], [9]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM