Imagine a CSV file with 3 columns: individual name, group name, group ID. Obviously column 1 is different for every line while column 2 and 3 can be the same as before (every group name has an individual ID though). This is not sorted in any way.
For reasons I'm creating a dict to save: group ID (key) --> group name (value). Now what is faster of the following variants?
checking if that key already exists and only saving if not.
if ID not in group_dict: group_dict[ID] = name
just saving it every time again (replacing the value, which is the same anyway).
group_dict[ID] = name
It's really best to profile the code when you have a question like this. Python provides the timeit
module, which is useful for this purpose. Here is some code you can use to experiment with,
import timeit
setup_code = """
import random
keysize = 20
valsize = 32
store = dict()
data = [(random.randint(0, 2**keysize), random.randint(0, 2**valsize)) for _ in range(1000000)]
"""
query = """
for key, val in data:
if key not in store:
store[key] = val
"""
no_query = """
for key, val in data:
store[key] = val
"""
if __name__ == "__main__":
print(timeit.timeit(stmt=query, setup=setup_code, number=1))
print(timeit.timeit(stmt=no_query, setup=setup_code, number=1))
The performance of the code depends upon the number of key collisions. In this code, if you increase keysize
you will have fewer collisions and checking the dict first will be slower. Conversely, if you reduce the keysize
the number of collisions will increase and checking the dict starts to perform better. The take away here is that the number of collision you have will determine which of these approaches is preferable.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.