Given a python dict with multiple million entries, what is the most efficient way to get and remove a random (k,v) pair from it?
The dict is constantly growing and the random remove function is called very often.
The most cited solution for python2 random_key = random.choice(the_dict.keys())
is way too slow as a list of all the keys is created first. With lots of elements in the dict, this solution does not work.
Another proposed solution is the_dict.popitem()
, but this does not return a real random object, but depends on the internal ordering of the dict.
The third solution that is also way to slow is an iterator:
it = the_dict.iterkeys()
for i in range (random.randint(0, len(the_dict)-1)):
next(it)
random_key = next(it)
Next to the remove_random()
, sometimes a the_dict.pop(x)
is required for a specific key. Therefore, a simple list based secondary index does not work.
Can this problem be efficiently be solved with a dict?
A solution is to use a bidirectional map each key to an integer, to allow for randomly selecting a key by using random.randrange(0,N) to select from a range of ints which are bidirectionally mapped to the keys, where N is the number of keys.
Adding a new key simply assigns it the next higher int. Deleting a key reassigns the int for that key to the key that was assigned the previously highest int before deleting the key-value pair. Python code provided for clarity.
Python code:
def create(D): # O(len(D))
# Create the bidirectional maps from the dictionary, D
keys = D.keys()
ints = range(len(keys)
int_to_key = dict(zip(keys, ints))
key_to_int = dict(zip(ints, keys))
return (int_to_key, key_to_int)
def add(D, int_to_key, key_to_int, key, value): # O(1)
# Add key-value pair (no extra work needed for simply changing the value)
new_int = len(D)
D[key] = value
int_to_key[new_int] = key
key_to_int[key] = new_int
def remove(D, int_to_key, key_to_int, key): # O(1)
# Update the bidirectional maps then remove the key-value pair
# Get the two ints and keys.
key_int = key_to_int[key]
swap_int = len(D) - 1 # Should be the highest int
swap_key = int_to_key[swap_int]
# Update the bidirectional maps so that key now has the highest int
key_to_int[key], key_to_int[swap_key] = swap_int, key_int
int_to_key[key_int], int_to_key[swap_int] = swap_key, key
# Remove elements from dictionaries
D.remove(key)
key_to_int.remove(key)
int_to_key.remove(key)
def random_key(D, int_to_key): # O(1)
# Select a random key from the dictionary using the int_to_key map
return int_to_key[random.randrange(0, len(D))]
def remove_random(D, int_to_key, key_to_int): # O(1)
# Randomly remove a key from the dictionary via the bidirectional maps
key = random_key(D, int_to_key)
remove(D, int_to_key, key_to_int, key)
Note: Adding/removing keys from D without using the appropriate above functions will break the bidirectional map. This means it's best to implement this as a class.
No, as you've discovered, this can't be done efficiently with a plain dict. See this issue for some explanations about why implementing random.choice
for sets is hard; the same arguments apply to dictionaries.
But it's possible to create a dict-like data structure that does support efficient random selection. Here's a recipe for such an object, based in part on this question and its responses. It's only a starting-point, but it supports most of the existing dict methods, many of which are conveniently filled in by the MutableMapping
ABC . Depending on your needs, you may need to flesh it out a bit: for example, to be able to create a RandomChoiceDict
directly from a regular dict, or to add a meaningful __repr__
, etc.
Essentially, you need to maintain three structures: a list
of keys, a list
of corresponding values, and a dict
that maps keys back to indices (the inverse of the keys list). The basic __getitem__
, __setitem__
and __delitem__
operations can be simply implemented in terms of those structures, and if __len__
and __iter__
are specified, the abstract base class takes care of most of the rest.
from collections import MutableMapping
import random
class RandomChoiceDict(MutableMapping):
"""
Dictionary-like object allowing efficient random selection.
"""
def __init__(self):
# Add code to initialize from existing dictionaries.
self._keys = []
self._values = []
self._key_to_index = {}
def __getitem__(self, key):
return self._values[self._key_to_index[key]]
def __setitem__(self, key, value):
try:
index = self._key_to_index[key]
except KeyError:
# Key doesn't exist; add a new one.
index = len(self._keys)
self._key_to_index[key] = index
self._keys.append(key)
self._values.append(value)
else:
# Key already exists; overwrite the value.
self._values[index] = value
def __delitem__(self, key):
index = self._key_to_index.pop(key)
# Remove *last* indexed element, then put
# it back at position 'index' (overwriting the
# one we're actually removing) if necessary.
key, value = self._keys.pop(), self._values.pop()
if index != len(self._key_to_index):
self._keys[index] = key
self._values[index] = value
self._key_to_index[key] = index
def __len__(self):
return len(self._key_to_index)
def __iter__(self):
return iter(self._keys)
def random_key(self):
"""Return a randomly chosen key."""
if not self:
raise KeyError("Empty collection")
index = random.randrange(len(self))
return self._keys[index]
def popitem_random(self):
key = self.random_key()
value = self.pop(key)
return key, value
Example usage:
>>> d = RandomChoiceDict()
>>> for x in range(10**6): # populate with some values
... d[x] = x**2
...
>>> d.popitem_random() # remove and return random item
(132545, 17568177025)
>>> 132545 in d
False
>>> d.popitem_random()
(954424, 910925171776)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.