简体   繁体   中英

What is the fastest key type for the dictionaries in python? tuple, frozenset…?

Context: I am trying to speed up the execution time of k-means. For that, I pre-compute the means before the k-means execution. These means are stored in a dictionary called means_dict which has as a key a sequence of the points id ordered in ascending order and then joining by an underscore ,and as a value the mean of these points. When I want to access to the mean of a given points set in dict_mean dictionary during the k-means execution, I have to generate the key of that points set ie order the id points in ascending order and joining them by an underscore. The key generation instruction takes a long time because I the key may contain thousands of integers.

I have for each key a sequence of integers separated by an underscore "-" in a dictionary. I have to sort the sequence of integers before joining them by an underscore in order to make the key unique, I finaly obtain a string key. The problem is this process is so long. I want to use an another type of key that permits to avoid sorting the sequence and that key type should be faster than the string type in terms of access, comparison and search.

 # means_dict is the dictionary containing as a key a string (sequence of 
 # integers joined by underscore "-", for example key="3-76-45-78-344")
 # points is a dictionary containing for each value a list of integers
 for k in keys:
     # this joining instruction is so long       
     key = "_".join([ str(c) for c in sorted(points[k])])        
     if( key in means_dict ):
         newmu.append( means_dict[key] )

Computing the means is cheap.

Did you profile your program? How much of the time is spent recomputing he means? With proper numpy arrays instead of python boxed arrays, this should be extremely cheap - definitely cheaper than constructing any such key!

The reason why computing the key is expensive is simple: it means constructing an object of varying size. And based on your description, it seems you will be building first a list of boxed integers, then a tuple of boxes integers, then serialize this into a string and then copy the string again to append the underscore. There is no way this is going to be faster than the simple - vectorizable - aggregation when computing the actual mean...

You could even use MacQueens approach to update means rather than recomputing them. But even that is often slower than recomputing them.

I wouldn't be surprised if your approach ends up being 10x slower than regular k-means... And probably 1000x slower than the clever kmeans algorithms such as Hartigan and Wong's.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM