简体   繁体   中英

How to remove duplicates of list of lists based on element in nested list

I want to remove duplicates from a list of lists. The first element is NOT always unique for every second element in the nested list. The first value is unique for the whole list of lists. The numbers occurs only once in the whole list of lists, but are not ordered.

my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5, 'C']]

Removing the duplicates is based on the second element in the nested list. I need the minimum value for each unique second element, like:

my_unique_list = [[1, 'A'], [3, 'B'], [4, 'C']]

It doesn't matter what order the output is in.

So, pick 1 for 'A' (as 1 is lower than 2 from [2, 'A'] ), 3 for 'B' (there are no other values for 'B' ), and 4 for 'C' (as 4 is lower than 5, from [5, 'C'] ).

Use a dictionary to map unique letters (second values) to the minimum value for each letter, then simply take the [value, key] pairs from that dictionary as your output:

minimi = {}
inf = float('inf')
for val, key in my_list:
    # float('inf') as default value is always larger, so val is picked
    # if the key isn't present yet.
    minimi[key] = min(val, minimi.get(key, inf))

my_unique_list = [[v, k] for k, v in minimi.items()]

By using a dictionary as intermediary you can filter the input in linear time.

Demo:

>>> my_list = [[4, 'C'], [1, 'A'], [3, 'B'], [2, 'A'], [5,'C']]
>>> minimi, inf = {}, float('inf')
>>> for val, key in my_list:
...     minimi[key] = min(val, minimi.get(key, inf))
...
>>> minimi
{'C': 4, 'A': 1, 'B': 3}
>>> my_unique_list = [[v, k] for k, v in minimi.items()]
>>> my_unique_list
[[4, 'C'], [1, 'A'], [3, 'B']]

Why should you care about running time? Because as your input grows, so does your running time. For approaches that take O(N^2) (quadratic) time, as you go from 1000 items to 1 million (so 1000 times the size), your running time would increase by 1 million times! For O(N logN) approaches (those that use sorting), the running time would increase by ~2000 times, while a linear approach as above would take 1000 times as long, scaling linearly as your inputs scale.

For large inputs, that can make the difference between 'takes an hour or two' to 'takes millions of years'.

Here is a time-trial comparison between this approach and zamir's sorting-and-set approach (O(N logN)) as well as TJC World's Pandas approach (also O(N logN)):

from string import ascii_uppercase
from functools import partial
from timeit import Timer
import random
import pandas as pd

def gen_data(N):
    return [[random.randrange(1_000_000), random.choice(ascii_uppercase)] for _ in range(N)]

def with_dict(l, _min=min, _inf=float('inf')):
    minimi = {}
    m_get = minimi.get
    for val, key in l:
        minimi[key] = _min(val, m_get(key, _inf))
    return [[v, k] for k, v in minimi.items()]

def with_set_and_sort(l):
    already_encountered = set()
    ae_add = already_encountered.add
    return [i for i in sorted(l) if i[1] not in already_encountered and not ae_add(i[1])]

def with_pandas(l):
    return (
        pd.DataFrame(l)
        .sort_values(by=0)
        .drop_duplicates(1)
        .to_numpy()
        .tolist()
    )

for n in (100, 1000, 10_000, 100_000, 1_000_000):
    testdata = gen_data(n)
    print(f"{n:,} entries:")
    for test in (with_dict, with_set_and_sort, with_pandas):
        count, total = Timer(partial(test, testdata)).autorange()
        print(f"{test.__name__:>20}: {total/count * 1000:8.3f}ms")
    print()

I've used all the little performance tricks in there I know of; avoiding repeated lookups of globals and attributes by caching them in local names outside of the loops.

This outputs:

100 entries:
           with_dict:    0.028ms
   with_set_and_sort:    0.032ms
         with_pandas:    2.070ms

1,000 entries:
           with_dict:    0.242ms
   with_set_and_sort:    0.369ms
         with_pandas:    2.312ms

10,000 entries:
           with_dict:    2.331ms
   with_set_and_sort:    5.639ms
         with_pandas:    5.476ms

100,000 entries:
           with_dict:   23.105ms
   with_set_and_sort:  127.772ms
         with_pandas:   40.330ms

1,000,000 entries:
           with_dict:  245.982ms
   with_set_and_sort: 2494.305ms
         with_pandas:  578.952ms

So, with only 100 inputs, the sorting approach may appear to be as fast (a few ms difference either way), but as the inputs grow that approach loses ground at an accelerating pace.

Pandas loses out on all fronts here. Dataframes are a great tool, but the wrong tool here. They are hefty datastructures, so for small inputs their high overhead puts them into the millisecond range, way behind the other two options. At 10k entries it starts to beat the sorting-and-set approach, but even though dataframe operations are highly optimised, the growth of sorting runtime with larger inputs still can't beat a linear approach.

already_encountered = set()
my_new_list = [i for i in sorted(my_list) if i[1] not in already_encountered and not already_encountered.add(i[1])]

Output:

[[1, 'A'], [3, 'B'], [4, 'C']]

Using pandas;

>>> import pandas as pd
>>> my_list = [[1, 'A'], [2, 'A'], [3, 'B'], [4, 'C'], [5,'C']]
>>> df = pd.DataFrame(my_list)
>>> df.sort_values(by = 0).drop_duplicates(1).to_numpy().tolist()
[[1, 'A'], [3, 'B'], [4, 'C']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM