简体   繁体   English

如何根据嵌套列表中的元素删除列表列表的重复项

[英]How to remove duplicates of list of lists based on element in nested list

I want to remove duplicates from a list of lists.我想从列表列表中删除重复项。 The first element is NOT always unique for every second element in the nested list.对于嵌套列表中的每个第二个元素,第一个元素并不总是唯一的。 The first value is unique for the whole list of lists.第一个值对于整个列表列表是唯一的。 The numbers occurs only once in the whole list of lists, but are not ordered.这些数字在整个列表中只出现一次,但没有排序。

my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5, 'C']]

Removing the duplicates is based on the second element in the nested list.删除重复项基于嵌套列表中的第二个元素。 I need the minimum value for each unique second element, like:我需要每个唯一的第二个元素的最小值,例如:

my_unique_list = [[1, 'A'], [3, 'B'], [4, 'C']]

It doesn't matter what order the output is in.输出的顺序无关紧要。

So, pick 1 for 'A' (as 1 is lower than 2 from [2, 'A'] ), 3 for 'B' (there are no other values for 'B' ), and 4 for 'C' (as 4 is lower than 5, from [5, 'C'] ).所以,挑1'A'如图1是从低于2 [2, 'A'] 3'B'有没有其他值'B'4'C'如4 小于 5,来自[5, 'C'] )。

Use a dictionary to map unique letters (second values) to the minimum value for each letter, then simply take the [value, key] pairs from that dictionary as your output:使用字典将唯一字母(第二个值)映射到每个字母的最小值,然后简单地将该字典中的[value, key]对作为您的输出:

minimi = {}
inf = float('inf')
for val, key in my_list:
    # float('inf') as default value is always larger, so val is picked
    # if the key isn't present yet.
    minimi[key] = min(val, minimi.get(key, inf))

my_unique_list = [[v, k] for k, v in minimi.items()]

By using a dictionary as intermediary you can filter the input in linear time.通过使用字典作为中介,您可以在线性时间内过滤输入。

Demo:演示:

>>> my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5,'C']]
>>> minimi, inf = {}, float('inf')
>>> for val, key in my_list:
...     minimi[key] = min(val, minimi.get(key, inf))
...
>>> minimi
{'C': 4, 'A': 1, 'B': 3}
>>> my_unique_list = [[v, k] for k, v in minimi.items()]
>>> my_unique_list
[[4, 'C'], [1, 'A'], [3, 'B']]

Why should you care about running time?为什么要关心运行时间? Because as your input grows, so does your running time.因为随着您输入的增加,您的运行时间也会增加。 For approaches that take O(N^2) (quadratic) time, as you go from 1000 items to 1 million (so 1000 times the size), your running time would increase by 1 million times!对于需要 O(N^2)(二次)时间的方法,当您从 1000 个项目增加到 100 万个(大小是 1000 倍)时,您的运行时间将增加 100 万倍! For O(N logN) approaches (those that use sorting), the running time would increase by ~2000 times, while a linear approach as above would take 1000 times as long, scaling linearly as your inputs scale.对于 O(N logN) 方法(使用排序的方法),运行时间将增加约 2000 倍,而上述线性方法将花费 1000 倍的时间,随着输入的缩放而线性缩放。

For large inputs, that can make the difference between 'takes an hour or two' to 'takes millions of years'.对于大量输入,这可以区分“需要一两个小时”到“需要数百万年”。

Here is a time-trial comparison between this approach and zamir's sorting-and-set approach (O(N logN)) as well as TJC World's Pandas approach (also O(N logN)):这是此方法与 zamir 的排序和设置方法 (O(N logN)) 以及 TJC World 的 Pandas 方法(也是 O(N logN))之间的时间试验比较:

from string import ascii_uppercase
from functools import partial
from timeit import Timer
import random
import pandas as pd

def gen_data(N):
    return [[random.randrange(1_000_000), random.choice(ascii_uppercase)] for _ in range(N)]

def with_dict(l, _min=min, _inf=float('inf')):
    minimi = {}
    m_get = minimi.get
    for val, key in l:
        minimi[key] = _min(val, m_get(key, _inf))
    return [[v, k] for k, v in minimi.items()]

def with_set_and_sort(l):
    already_encountered = set()
    ae_add = already_encountered.add
    return [i for i in sorted(l) if i[1] not in already_encountered and not ae_add(i[1])]

def with_pandas(l):
    return (
        pd.DataFrame(l)
        .sort_values(by=0)
        .drop_duplicates(1)
        .to_numpy()
        .tolist()
    )

for n in (100, 1000, 10_000, 100_000, 1_000_000):
    testdata = gen_data(n)
    print(f"{n:,} entries:")
    for test in (with_dict, with_set_and_sort, with_pandas):
        count, total = Timer(partial(test, testdata)).autorange()
        print(f"{test.__name__:>20}: {total/count * 1000:8.3f}ms")
    print()

I've used all the little performance tricks in there I know of;我已经使用了我所知道的所有性能小技巧; avoiding repeated lookups of globals and attributes by caching them in local names outside of the loops.通过将它们缓存在循环之外的本地名称中,避免重复查找全局变量和属性。

This outputs:这输出:

100 entries:
           with_dict:    0.028ms
   with_set_and_sort:    0.032ms
         with_pandas:    2.070ms

1,000 entries:
           with_dict:    0.242ms
   with_set_and_sort:    0.369ms
         with_pandas:    2.312ms

10,000 entries:
           with_dict:    2.331ms
   with_set_and_sort:    5.639ms
         with_pandas:    5.476ms

100,000 entries:
           with_dict:   23.105ms
   with_set_and_sort:  127.772ms
         with_pandas:   40.330ms

1,000,000 entries:
           with_dict:  245.982ms
   with_set_and_sort: 2494.305ms
         with_pandas:  578.952ms

So, with only 100 inputs, the sorting approach may appear to be as fast (a few ms difference either way), but as the inputs grow that approach loses ground at an accelerating pace.因此,只有 100 个输入时,排序方法可能看起来同样快(两种方式都有几毫秒的差异),但是随着输入的增加,这种方法会以加速的速度失去优势。

Pandas loses out on all fronts here.熊猫在这里在所有方面都输了。 Dataframes are a great tool, but the wrong tool here.数据框是一个很好的工具,但这里是错误的工具。 They are hefty datastructures, so for small inputs their high overhead puts them into the millisecond range, way behind the other two options.它们是庞大的数据结构,因此对于小输入,它们的高开销将它们置于毫秒范围内,远远落后于其他两个选项。 At 10k entries it starts to beat the sorting-and-set approach, but even though dataframe operations are highly optimised, the growth of sorting runtime with larger inputs still can't beat a linear approach.在 10k 个条目时,它开始击败排序和设置方法,但即使数据帧操作高度优化,具有更大输入的排序运行时间的增长仍然无法击败线性方法。

already_encountered = set()
my_new_list = [i for i in sorted(my_list) if i[1] not in already_encountered and not already_encountered.add(i[1])]

Output:输出:

[[1, 'A'], [3, 'B'], [4, 'C']]

Using pandas;使用熊猫;

>>> import pandas as pd
>>> my_list = [[1, 'A'], [2, 'A'], [3, 'B'], [4, 'C'], [5,'C']]
>>> df = pd.DataFrame(my_list)
>>> df.sort_values(by = 0).drop_duplicates(1).to_numpy().tolist()
[[1, 'A'], [3, 'B'], [4, 'C']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM