快速删除列表中的连续重复项和另一个列表中的相应项目

Question

我的问题类似于以前的 SO 问题。 我有两个非常大的数据列表（将近 2000 万个数据点），其中包含许多连续的重复项。 我想删除连续的重复项，如下所示：

list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]  # This is 20M long!
list2 = ...  # another list of size len(list1), also 20M long!
i = 0
while i < len(list)-1:
    if list[i] == list[i+1]:
        del list1[i]
        del list2[i]
    else:
        i = i+1

第一个列表的输出应该是[1, 2, 3, 4, 5, 1, 2] 。 不幸的是，这很慢，因为删除列表中的元素本身就是一个缓慢的操作。 有什么办法可以加快这个过程吗？ 请注意，如上面截取的代码所示，我还需要跟踪索引i以便我可以删除list2中的相应元素。

Answer 1

Python 在库中为您提供了这个groupby ：

>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [k for k,_ in groupby(list1)]
[1, 2, 3, 4, 5, 1, 2]

您可以使用keyfunc参数调整它，同时处理第二个列表。

>>> list1 = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> list2 = [9,9,9,8,8,8,7,7,7,6,6,6,5]
>>> from operator import itemgetter
>>> keyfunc = itemgetter(0)
>>> [next(g) for k,g in groupby(zip(list1, list2), keyfunc)]
[(1, 9), (2, 7), (3, 7), (4, 7), (5, 6), (1, 6), (2, 5)]

如果您想再次将这些对拆分回单独的序列：

>>> zip(*_)  # "unzip" them
[(1, 2, 3, 4, 5, 1, 2), (9, 7, 7, 7, 6, 6, 5)]

Answer 2

您可以使用 collections.deque 及其 max len 参数将窗口大小设置为 2。然后只需比较窗口中 2 个条目的重复性，如果不同，则附加到结果中。

def remove_adj_dups(x):
"""
:parameter x is something like '1, 1, 2, 3, 3'
    from an iterable such as a string or list or a generator
:return 1,2,3, as list
"""

    result = []
    from collections import deque
    d = deque([object()], maxlen=2)  # 1st entry is object() which only matches with itself. Kudos to Trey Hunner -->object()

    for i in x:
        d.append(i)
        a, b = d
        if a != b:
            result.append(b)
    return result

我生成了一个随机列表，其中包含0 到 10 之间的 2000 万个数字的重复项。

def random_nums_with_dups(number_range=None, range_len=None):
    """
    :parameter
    :param number_range: use the numbers between 0 and number_range. The smaller this is then the more dups
    :param range_len: max len of the results list used in the generator
    :return: a generator

    Note: If number_range = 2, then random binary is returned
    """

    import random
    return (random.choice(range(number_range)) for i in range(range_len))

然后我测试了

range_len = 2000000
def mytest():
    for i in [1]:
        return [remove_adj_dups(random_nums_with_dups(number_range=10, range_len=range_len))]
big_result = mytest()

big_result = mytest()[0]
print(len(big_result))

len 为 1800197（删除了重复读取），在 <5 秒内，其中包括旋转的随机列表生成器。 我缺乏经验/诀窍来判断它是否也具有内存效率。 有人可以评论吗

快速删除列表中的连续重复项和另一个列表中的相应项目

问题描述

2 个解决方案

解决方案1
9 已采纳 2017-01-06 18:01:53

解决方案2
0 2020-10-01 07:28:13

快速删除列表中的连续重复项和另一个列表中的相应项目

问题描述

2 个解决方案

解决方案1 9 已采纳 2017-01-06 18:01:53

解决方案2 0 2020-10-01 07:28:13

解决方案1
9 已采纳 2017-01-06 18:01:53

解决方案2
0 2020-10-01 07:28:13