简体   繁体   中英

Getting unique combinations from a non-unique list of items, FASTER?

First off, I am able to do it but I am not happy with the speed.

My question is, Is there a better, faster way of doing this?

I have a list of items looking like this:

[(1,2), (1,2), (4,3), (7,8)]

And I need to get all the unique combinations. For example, the unique combinations of 2 items would be:

[(1,2), (1,2)], [(1,2), (4,3)], [(1,2), (7,8)], [(4,3), (7,8)]

After using itertools.combinations I get a lot more than that because of duplicates. For example, I get every list containing (1,2) twice. If I create a set of these combinations I get the unique ones. The problem comes when the original list has 80 tuples and I want combinations with 6 items in them. Getting that set takes more than 30 seconds. If I can get that number down I would be very happy.

I am aware that the number of combinations is huge and that's why creating the set is time-consuming. But I am still hoping that there is a library that has optimized the process in some way, speeding it up a bit.

It might be important to note that from all the combinations I find I test out only the first 10000 or so. Because in some cases all combos can be waay too much to process so I don't really want to spend too much time on them as there are other tests to be done too.

This is a sample of what I have now:

from itertools import combinations

ls = [list of random NON-unique sets (x,y)]
# ls = [(1,2), (1,2), (4,3), (7,8)]  # example
# in the second code snipped it is shown how I generate ls for testing

all_combos = combinations(ls, 6)
all_combos_set = set(all_combos)

for combo in all_combos_set:
  do_some_test_on(combo)

In case you want to test it out .. here is what I use for testing the speed of different methods:

def main3():
    tries = 4
    elements_in_combo = 6
    rng = 90
    data = [0]*rng
    for tr in range(tries):
        for n in range(1, rng):
            quantity = 0
            name = (0,0)
            ls = []
            for i in range(n):
                if quantity == 0:
                    quantity = int(abs(gauss(0, 4)))
                    if quantity != 0:
                        quantity -= 1
                    name = (randint(1000,7000), randint(1000,7000))
                    ls.append(name)
                else:
                    quantity -= 1
                    ls.append(name)

            start_time = time.time()
            all_combos = combinations(ls, elements_in_combo)
            all_combos = set(all_combos)

            duration = time.time() - start_time
            data[n] += duration
            print(n, "random files take", duration, "seconds.")

            if duration > 30:
                break

    for i in range(rng):
        print("average duration for", i, "is", (data[i]/tries), "seconds.")

The originally asked question "is there a better, faster way of doing this?" has actually two questions in it:

  • Is there a faster way?
  • Is there a better way?

I would like to narrow the answer to the question "Is there a faster way?" to:

Is there a FASTER way of removing duplicates from a list as doing it as follows:

lstWithUniqueElements = list(set(lstWithDuplicates))

?

To my knowledge, there is no faster way ...

Now let's concentrate more on the second part of the question ( "Is there a better way?" ). It is usually very hard and needs much discussion to answer such kind of question, but it will be not the case here, because what a better way is, was already clearly stated by the author of the question (citation):

I'd love to use a generator function. The itertools combinations() itself is an iterable and not a list or set, so if I figure out how to yield the unique combinations that'd be great.

So HERE it is :

def uniqueCombinations(lstList, comboSize): 
    from itertools import combinations
    lstList.sort()
    allCombos = combinations(lstList, comboSize)
    setUniqueCombos = set()
    for comboCandidate in allCombos:
        if comboCandidate in setUniqueCombos:
            continue
        yield comboCandidate
        setUniqueCombos.add(comboCandidate)

That's it ...


One more thing is maybe worth to mention here. The by the author of the question chosen method of getting unique combinations in case the list they are generated from has not only unique but also multiple elements with same value doesn't work in some special cases like this one:

 set(combinations(['a','a','b','a'], 2)) gives: {('a', 'b'), ('b', 'a'), ('a', 'a')} uniqueCombinations(['a','a','b','a'],2) gives: {('a', 'b'), ('a', 'a')}

In between there is a pure Python function available here on stackoverflow which is both faster and slower as this one provided above. How can it be faster AND slower? See HERE for details.

I guess this answer comes well after the OP needs it, but I run into the same issue and I would like to contribute my solution. I wanted to store no combination in the memory, because it is easy to see how this can go wrong.

First, this link provides a very clear explanation about how to calculate the number of distinct combinations when elements are repeated. The strategy is to create combinations with replacement and then discard the combinations that are not allowed.

For example, if the collection is (A, A, B, B) and you want all combinations of 3 elements, the combinations (A, A, A) and (B, B, B) are not allowed. Therefore, the idea is to create all possible combinations with replacement from the list of unique elements in the original sets and then discard those combinations that are not valid. This does not occupy the memory with any lookup and is easy to write.

However, this strategy is wasteful when we have sets with many unique elements. Taking this issue to its extreme, the only 3-elements-long combination from the set (A, B, C) is clearly (A, B, C), but this strategy would produce (A, A, A), (A, A, B), ... To alleviate this issue, one may notice that a unique element can only appear once in a valid combination: with unique elements, the standard itertools.combinations() will do.

Therefore, if we have a mixture of unique and repeated element, the final combinations can be split in a part that is generated from unique elements with itertools.combinations() and a part that is generated from itertools.combinations_with_replacement() for repeated elements.

All in all, this is the code. How fast it runs really depends on the amount of repetition in the original collection. The worst case scenario is the one without repetition:

import itertools
from collections import Counter

#Check if an element is repeated more times than allowed.
def comb_check(comb, original_dic):
    trouble = False
    if not comb:
        return(not trouble)
    comb_unique = set(comb)
    ratio = len(comb_unique)/len(comb)
    if ratio < 1:
       comb = Counter(comb)
       ks = (v for v in comb_unique)
       complete = False
       while (not trouble) and (not complete):
           try:
               k = next(ks)
               if comb[k] > 1:
                   if original_dic[k] < comb[k]: trouble = True
           except StopIteration:
               complete = True
    return(not trouble)

def generate_comb(elements,k):
    elements = Counter(elements)
    elements_unique = [k for k,v in elements.items() if v == 1]
    elements_other = [k for k, v in elements.items() if k not in elements_unique]
    max_repetition = sum([elements[k] for k in elements_other ])
    for n in range(0, min(k+1,len(elements_unique)+1)):
        if (n + max_repetition)>= k:
            for i in itertools.combinations(elements_unique, n):
                for j in itertools.combinations_with_replacement(elements_other, k-n):
                    if comb_check(j, elements):
                        (yield  j)

#All unique elements is the worst case when it comes to time
lst = [a for a in range(80)]
for k in generate_comb(lst, 6):
    pass
#It took my machine ~ 264 sec to run this

#Slightly better
lst = [a for a in range(40)] + [a for a in range(40)]
for k in generate_comb(lst, 6):
    pass
#It took my machine ~ 32 sec to run this

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM