简体   繁体   中英

comparing a list of lists with itself

I have somewhat resolved this issue and I'm just trying to figure out a more efficient way of doing this. I have a big list of lists and I am trying to compare each list in the big list with one another.

How can I avoid duplicate comparisons, comparing lists that have already been compared?

Ex: big_list[0] has been compared with big_list[20] so there is no reason to compare big_list[20] with big_list[0] later on in the loop.

        big_list= [[0.12, 0.939, -0.321, 6.342], [0.12, 0.939, -0.321,6.342], [0.0, 1.0, -0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0, 1.0, -0.0, -5.166], [-0.0, 1.0, 0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [0.0,1.0, -0.0, -5.166], [0.0, 1.0, 0.0, -5.166], [-0.0, 1.0, -0.0, -5.166], [-0.0, 1.0, 0.0, -5.166], [-0.12, 0.939, 0.321, 0.282], [-0.12, 0.939, 0.321, 0.282], [0.12, 0.939, 0.321, -17.782], [0.12, 0.939, 0.321, -17.782], [-0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [0.0, 1.0, 0.0, 0.834], [-0.12, 0.939, -0.321, 24.406], [-0.12, 0.939, -0.321, 24.406], [0.0, 0.874, -0.486, 21.883], [0.0, 0.874, -0.486, 21.883], [0.0, 0.874, 0.486, -14.598], [0.0, 0.874, 0.486, -14.598]]

        for j in range(len(big_list)):
            for k in range(len(big_list)):
                if j!=k: 

                   result=math.sqrt(sum([(a-b)**2 for a,b in zip(big_list[j],big_list[k])])))

Previously, I resolved the matter by setting a specific tolerance and appending each result to a new list but I am trying to come up with a more efficient way of doing this. Eventually, big_list will probably have 1 million+ lists

if result<=rel_tol and big_list[k] not in new_list:
    new_list.append(big_list[k])

Instead of doing:

for j in range(len(big_list)):
        for k in range(len(big_list)):

do this (note the j+1 ):

for j in range(len(big_list)):
        for k in range(j+1, len(big_list)):

This way your inner loop skips over all of the indexes you've already looked at, avoiding duplicate comparisons.

What @Justin replied was also my first thought, but after reflection I don't believe that's the most efficient for real big_lists . Instead, I'd use set() s of tuples :

tupled_set = set([tuple(i) for i in big_list])
new_list_of_tuples = list(tupled_set)

Shorter and faster with only builtin iterations. Now of course if you need to go back to lists (for some reason need mutable sub lists) then the performance gain may be lost (not sure tbh, would need to benchmark) but readability isn't:

list_of_lists = [list(i) for i in new_list_of_tuples]

Cheers

Instead of comparing the lists against each other with two for loops you could convert them to tuples and use Counter to see how many instances there are. Then you could iterate over the list and pick the first occurrence of every sublist that has multiple instances.

from collections import Counter

c = Counter(tuple(l) for l in big_list)
new_list = []
for l in big_list:
    t = tuple(l)
    if c[t] > 1:
        new_list.append(l)
        c[t] = 0

This has O(n) time complexity and it will result to the same ordering as original code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM