简体   繁体   中英

why is this o(n) three-way set disjointness algorithm slower than then o(n^3) version?

O(n) because converting list to set is O(n) time, getting intersection is O(n) time and len is O(n)

def disjoint3c(A, B, C):
    """Return True if there is no element common to all three lists."""
    return len(set(A) & set(B) & set(C)) == 0

or similarly, should be clearly O(N)

def set_disjoint_medium (a, b, c):
    a, b, c = set(a), set(b), set(c)
    for elem in a:
        if elem in b and elem in c:
            return False
    return True

yet this O(n^3) code:

def set_disjoint_slowest (a, b, c):
    for e1 in a:
        for e2 in b:
            for e3 in c:
                if e1 == e2 == e3:
                    return False
    return True

runs faster

see time where algorithm one is the n^3, and algorithm three is the O(n) set code... algorithm two is actually n^2 where we optimize algorithm one by checking for disjointness before the third loop starts

Size Input (n):  10000

Algorithm One: 0.014993906021118164

Algorithm Two: 0.013481855392456055

Algorithm Three: 0.01955580711364746

Size Input (n):  100000

Algorithm One: 0.15916991233825684

Algorithm Two: 0.1279449462890625

Algorithm Three: 0.18677806854248047

Size Input (n):  1000000

Algorithm One: 1.581618070602417

Algorithm Two: 1.146049976348877

Algorithm Three: 1.8179030418395996

The comments made clarifications about the Big-Oh notations. So I will just start with testing the code.

Here is the setup I used for testing the speed of the code.

import random

# Collapsed these because already known
def disjoint3c(A, B, C):
def set_disjoint_medium (a, b, c):
def set_disjoint_slowest (a, b, c):

a = [random.randrange(100) for i in xrange(10000)]
b = [random.randrange(100) for i in xrange(10000)]
c = [random.randrange(100) for i in xrange(10000)]

# Ran timeit.
# Results with timeit module.
1-) 0.00635750419422
2-) 0.0061145967287
3-) 0.0487953200969

Now to the results, as you see, the O(n^3) solution runs 8 times slower than the other solutions. But this is still fast for such an algorithm(Even faster in your test). Why this happens ?

Because medium and slowest solutions you used, finishes the execution of the code as soon as a common element is detected . So the full complexity of the code is not realized. It breaks as soon as it finds an answer. Why the slowest solution ran almost as fast as the other ones in your test ? Probably because it finds the answer closer to the beginning of the lists.

To test this, you could create the lists like this. Try this yourself.

a = range(1000)
b = range(1000, 2000)
c = range(2000, 3000)

Now the real difference between the times will be obvious because the slowest solution will have to run until it finishes all iterations, because there is no common element.

So it is a situation of Worst case and Best case performance.

Not a part of the question edit: So, what if you want to retain the speed of finding early common occurances, but also don't want to increase complexity. I made a rough solution for that, maybe more experienced users can suggest faster code.

def mysol(a, b, c):
    store = [set(), set(), set()]

    # zip_longest for Python3, not izip_longest.
    for i, j, k in itertools.izip_longest(a, b, c):
        if i: store[0].add(i)
        if j: store[1].add(j)
        if k: store[2].add(k)

        if (i in store[1] and i in store[2]) or (j in store[0] and i in store[2]) or (k in store[0] and i in store[1]):
            return False
    return True

What is basically being done in this code is, you avoid converting all the lists to sets in the beginning. Rather, iterate through all lists at the same time, add elements to sets, check for common occurances. So now, you keep the speed of finding an early solution, but it is still slow for the worst case that I shown.

For the speeds, this runs 3-4 times slower than your first two solutions in the worst case. But runs 4-10 times faster than those solutions in randomized lists.

Note: The fact that you are finding all common elements in three lists( in the first solution) unquestionably means that there is a faster solution by theory. Because you just need to know if there is even a single common element, and that knowledge is enough.

O notation ignores all the constant factors. So it will only answer for infinite data sets. For any finite set, it is only a rule of thumb.

With interpreted languages such as Python and R, constant factors can be pretty large. They need to create and collect many objects, which is all O(1) but not free. So it is fairly common to see 100x performance differences of virtually equivalent code, unfortunately.

Secondly, the first algorithm computes all common elements, while the others fail on the first . If you benchmark algX(a,a,a) (yes, all three sets be the same) then it will do much more work than the others!

I would not be surprised to see a sort-based O(n log n) algorithm to be very competitive (because sorting is usually incredibly well optimized). For integers, I would use numpy arrays, and by avoiding the python interpreter as much as possible you can get very fast. While numpys in1d and intersect will likely give you aan O(n^2) or O(n^3) algorithm, they may end up being faster as long as your sets are usually disjoint.

Also note that in your case, the sets won't necessarily be pairwise disjoint... algX(set(),a,a)==True .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM