简体   繁体   中英

How does set intersection with '&' work under the hood?

set_1 = {1, 2, 3, 4, 5, 6, 7}
set_2 = {2, 4, 6, 8, 10, 12}

intersec = set_1 & set_2

What happens under the hood when & is used between two sets? What is it's time complexity?

The set data structure, and its operations, are implemented in C. The source code can be found on GitHub, in the file Objects/setobject.c . For the benefit of those who don't write C, here is a rough translation of how it works in Python:

# helper function; only self needs to be a set
def set_intersection(self, other):
    if self is other:
        return self.copy()
    
    result = set()
    
    if isinstance(other, (set, frozenset)):
        if len(other) > len(self):
            self, other = other, self
    
    for item in other:
        if item in self:
            result.add(item)
    return result

# when using the & operator; self and other both must be sets
def set_and(self, other):
    if not isinstance(self, (set, frozenset)) or not isinstance(other, (set, frozenset)):
        return NotImplemented
    return set_intersection(self, other)

# when using the .intersection method, which accepts multiple "other" arguments
def set_intersection_multi(self, *others):
     if not others:
         return self.copy()
     
     result = self
     for other in others:
         result = set_intersection(result, other)
     return result

This isn't a literal translation; the C code has one loop for the case where other is a set, and another loop when it isn't, because in the former case the item hashes are already known and don't need to be recomputed. There's no way to translate that detail, so both loops in C are translatable to the same loop in Python, and I simplified it. Another untranslatable difference is that in C, if self is a frozenset then the result will be too; but Python code can't use .add in that case.

Because testing membership (the in operator) of a set takes O(1) expected time, the expected time complexity of the intersection is O(min(m, n)) when both are sets, where m and n are the lengths of self and other .

When other is not a set, the expected time complexity of the intersection is equal to the time complexity of iterating over other , plus the time complexity of computing the hashes of all the items in other . This will typically be O(n), but theoretically could be higher.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM