简体   繁体   中英

Removing duplicates from a list of lists based on a comparison of an element of the inner lists

I have a large list of lists and need to remove duplicate elements based on specific criteria:

  1. Uniqueness is determined by the first element of the lists.
  2. Removal of duplicates is determined by comparing the value of the second element of the duplicate lists, namely keep the list with the lowest second element.

[[1, 4, 5], [1, 3, 4], [1, 2, 3]]

All the above lists are considered duplicates since their first elements are equal. The third list needs to be kept since it's second element is the smallest. Note the actual list of lists has over 4 million elements, is double sorted and ordering needs to be preserved.

The list is first sorted based on the second element of the inner lists and in reverse (descending) order, followed by normal (ascending) order based on the first element:

sorted(sorted(the_list, key=itemgetter(1), reverse=True), key=itemgetter(0))

An example of three duplicate lists in their actual ordering:

[...
[33554432, 50331647, 1695008306],
[33554432, 34603007, 1904606324],
[33554432, 33554687, 2208089473],
...]

The goal is to prepare the list for bisect searching. Can someone provide me with insight on how this might be achieved using Python?

You can group the elements using a dict, always keeping the sublist with the smaller second element:

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = {}
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[k] = sub

Also you can pass two keys to sorted, you don't need to call sorted twice:

In [3]:  l = [[1,4,6,2],[2,2,4,6],[1,2,4,5]]
In [4]: sorted(l,key=lambda x: (-x[1],x[0]))
Out[4]: [[1, 4, 6, 2], [1, 2, 4, 5], [2, 2, 4, 6]]

If you wanted to maintain order in the dict as per ordering needs to be preserved. :

from collections import OrderedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = OrderedDict()
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[sub[0]] = sub

But not sure how that fits as you are sorting the data after so you will lose any order.

What you may find very useful is a sortedcontainers.sorteddict :

A SortedDict provides the same methods as a dict. Additionally, a SortedDict efficiently maintains its keys in sorted order. Consequently, the keys method will return the keys in sorted order, the popitem method will remove the item with the highest key, etc.

An optional key argument defines a callable that, like the key argument to Python's sorted function, extracts a comparison key from each dict key. If no function is specified, the default compares the dict keys directly. The key argument must be provided as a positional argument and must come before all other arguments.

from sortedcontainers import SortedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = SortedDict()
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[k] = sub


print(list(d.values()))

It has all the methods you want bisect , bisect_left etc..

If I got it correctly, the solution might be like this:

mylist = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [7, 3, 6], [7, 1, 8]]

ordering = []
newdata = {}

for a, b, c in mylist:
    if a in newdata:
        if b < newdata[a][1]:
            newdata[a] = [a, b, c]
    else:
        newdata[a] = [a, b, c]
        ordering.append(a)

newlist = [newdata[v] for v in ordering]

So in newlist we will receive reduced list of [[1, 2, 3], [7, 1, 8]] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM