For sets S and T, why does Python's S -= T take O(len(T)) and not O(len(S))?

Question

The Set entries in this Python time complexity table say, and the comments below confirm, that S.difference_update(T) takes time O(len(T)) while S - T takes O(len(S)). The reason given is that the algorithm for the first is "for every element in T remove it from S", while the algorithm for the second is "for every element in S add it to the new set, if not in T".

Wouldn't the algorithm "for every element in S, remove it from S if it's in T" work identically well and be O(len(S))? Why not just do whichever is shorter?

I think I'm not seeing something.

Answer 1

Technically, there is not really a requirement in the operation that S is larger than T. It's easily possible that T is actually much larger than S:

>>> S = {1, 2, 3}
>>> T = {3, 4, 5, 6, 7, 8, 9}
>>> S - T
{1, 2}

So, choosing one or the other algorithm for all operation would be an arbitrary choice since you simply don't know which is actually shorter (if you don't know the sets).

But in general, it does not really matter. Both S and T are inputs and both O(|T|) and O(|S|) is still O(n) , ie linear. So it's not really an issue anyway.

I've checked with the source to verify further what exactly is happening.

S.difference(T) , S - T ( set_difference ): This calculates the difference between two set objects. It iterates over the elements in S and checks for each if it is included in T. If it's not, it is added to the result set.
If S is much larger than T , the implementation actually copies S and performs a S' -= T . As this leaves many items in S, it's cheaper than starting with an empty set and keep adding elements from S.
S.difference_update(T) ( set_difference_update ): First of all, this accepts multiple arguments. So technically, it cannot check for T's length and swap simply because there are multiple Ts around. And even more important, it supports Ts that are not sets themselves (any iterable), so it can only work by iterating through those iterables and remove those items from the set.
So for this, iterating over S isn't actually possible (since we don't have constant member check in the Ts).

So as it turns out, there is some reason why it happens like this. Those reasons are mostly hidden in the actual set method instead of the operator implementations (which do use the methods internally though). While you could possibly micro-optimize a few special cases further, as per above, this won't give you that much improvements though as technically, you still stay O(n) . And in usual applications (especially in Python), it's unlikely that such an operation will be your bottleneck.

For sets S and T, why does Python's S -= T take O(len(T)) and not O(len(S))?

Question

1 answers

solution1
3 ACCPTED 2015-03-11 10:39:39

For sets S and T, why does Python's S -= T take O(len(T)) and not O(len(S))?

Question

1 answers

solution1 3 ACCPTED 2015-03-11 10:39:39

solution1
3 ACCPTED 2015-03-11 10:39:39