What's the fastest way of finding a random index in a Python list, a large number of times?

Question

What's the best (fastest) way of extracting a random value from a list, a large number (>1M) of times?

I am currently in a situation where I have a graph represented as an adjacency list, whose inner lists can have vastly different length (in the range [2, potentially 100k]).

I need to iterate over this list to generate random walks, so my current solution is along the lines of

Get random node
Pick a random index from that node's adjacency list
Move to the new node
Go to 2
Repeat until the random walk is as long as needed
Go to 1

This works well enough when the graph isn't too big, however now I am working with a graph that contains >440k nodes, with a very large variance in the number of edges each node has.

The function I am using at the moment to extract the random index is

node_neighbors[int(random.random() * number_neighbors_of_node)]

This sped up the computation over my previous implementation, however it's still unacceptably slow for my purposes.

The number of neighbors of a node can go from 2 to tens of thousands, I cannot remove small nodes, I have to generate tens of thousands of random walks in this environment.

From profiling the code, most of the generation time is spent looking for these indexes, so I'm looking for a method that can reduce the time taken to do so. However, if it's possible to sidestep it entirely by modifying the algorithm, that would also be great.

Thanks!

EDIT: out of curiosity, I tested three variants of the same code using timeit and these are the results:

setup='''
import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

l = [list(range(random.randint(2, 10000))) for _ in range(10000)]
'''

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)] # Get a random node adj list 
    vv = v[random.randint(0, len(v)-1)] # Find a random neighbor in v

0.29709450000001425

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[np.random.choice(v)]

26.760767499999986

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.random()*(len(v)))]

0.19086300000000733

for _ in range(1000):    
    v = l[random.randint(0, 10000-1)]
    vv = v[int(random.choice(v))]

0.24351880000000392

Answer 1

Your solution (sol3) is already the fastest by a larger margin than your tests suggest. I adapted the performance measurements to eliminate the arbitrary selection of nodes in favour of a path traversal that will be much closer to your stated goal.

Here are the improved performance tests and result. I added sol5() to see if pre-computing a list of random values would make a difference (I was hoping numpy would vectorize that but it didn't go any faster).

Setup

import numpy as np
import random

# generate a random adjacency list, nodes have a number of neighbors between 2 and 10000

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
pathLen   = 1000

Solutions

def sol1():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[random.randint(0, len(node)-1)] # move to a random neighbor

def sol2():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[np.random.choice(node)]

def sol3():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.random()*(len(node)))]

def sol4():
    node = nodes[0]
    for _ in range(pathLen):
        node = nodes[int(random.choice(node))]

def sol5():
    node = nodes[0]
    for rng in np.random.random_sample(pathLen):
        node = nodes[int(rng*len(node))]

Measurements

from timeit import timeit
count = 100

print("sol1",timeit(sol1,number=count))
print("sol2",timeit(sol2,number=count))
print("sol3",timeit(sol3,number=count))
print("sol4",timeit(sol4,number=count))
print("sol5",timeit(sol5,number=count))

sol1 0.12516996199999975
sol2 30.445685411
sol3 0.03886452900000137
sol4 0.1244026900000037
sol5 0.05330073100000021

numpy is not very good at manipulating matrices that have variable dimensions (like your neighbor lists) but perhaps a way to accelerate the process is to vectorize the next node selection. By assigning a random float to each node in a numpy array, you can use that to navigate between nodes until your path comes back to an already visited node. Only then would you need to generate a new random value for that node. Presumably, and depending on the path length, there would be a relatively small number of these "collisions".

Using the same idea, and leveraging numpy's vectorizing, you can make several traversals in parallel by creating a matrix of node identifiers (columns) with each row being a parallel traversal.

To illustrate this, here's a function that advances multiple "ants" on their individual random paths through the nodes:

import numpy as np
import random

nodes   = [list(range(random.randint(2, 10000))) for _ in range(10000)]
nbLinks = np.array(list(map(len,nodes)),dtype=np.int)         # number of neighbors per node
npNodes = np.array([nb+[-1]*(10000-len(nb)) for nb in nodes]) # fixed sized rows for numpy

def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*nbLinks[antPos]
        antPos    = npNodes[antPos,nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

Example: 12 ants advancing randomly by 8 steps from random starting locations

print(moveAnts(12,8,allPaths=True))

"""
    [[8840 1302 3438 4159 2983 2269 1284 5031 1760]
     [4390 5710 4981 3251 3235 2533 2771 6294 2940]
     [3610 2059 1118 4630 2333  552 1375 4656 6212]
     [9238 1295 7053  542 6914 2348 2481  718  949]
     [5308 2826 2622   17   78  976   13 1640  561]
     [5763 6079 1867 7748 7098 4884 2061  432 1827]
     [3196 3057   27  440 6545 3629  243 6319  427]
     [7694 1260 1621  956 1491 2258  676 3902  582]
     [1590 4720  772 1366 2112 3498 1279 5474 3474]
     [2587  872  333 1984 7263  168 3782  823    9]
     [8525  193  449  982 4521  449 3811 2891 3353]
     [6824 9221  964  389 4454  720 1898  806   58]]
"""

Performance is not better on individual ants, but in parallel the time per-ant is much better

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.9538277329999989 / 100 --> 0.009538277329999989 per ant

[EDIT] I thought of a better array model for the variable sized rows and came up with an approach that will not waste memory in a (mostly empty) matrix of fixed dimension. The approach is to use a 1D array to hold the links of all nodes consecutively and two additional arrays to hold the starting position and number of neighbours. This data structure turns out to run even faster than the fixed sized 2D matrix.

import numpy as np
import random

nodes     = [list(range(random.randint(2, 10000))) for _ in range(10000)]
links     = np.array(list(n for neighbors in nodes for n in neighbors))
linkCount = np.array(list(map(len,nodes)),dtype=np.int) # number of neighbors for each node
firstLink = np.insert(np.add.accumulate(linkCount),0,0) # index of first link for each node



def moveAnts(antCount=12,stepCount=8,antPos=None,allPaths=False):
    if antPos is None:
        antPos = np.random.choice(len(nodes),antCount)
    paths = antPos[:,None]

    for _ in range(stepCount):
        nextIndex = np.random.random_sample(size=(antCount,))*linkCount[antPos]
        antPos    = links[firstLink[antPos]+nextIndex.astype(np.int)]
        if allPaths:
            paths = np.append(paths,antPos[:,None],axis=1)
        
    return paths if allPaths else antPos

from timeit import timeit
count = 100

antCount  = 100
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t) # 0.7157810379999994 / 100 --> 0.007157810379999994 per ant

The performance "per ant" improves as you add more of them, up to a point (roughly 10x faster than sol3):

antCount  = 1000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #3.9749405650000007, 0.0039749405650000005 per ant

antCount  = 10000
stepCount = 1000
ap = np.random.choice(len(nodes),antCount)

t = timeit(lambda:moveAnts(antCount,stepCount,ap),number=count)

print(t,t/antCount) #32.688697579, 0.0032688697579 per ant

What's the fastest way of finding a random index in a Python list, a large number of times?

Question

1 answers

solution1
1 ACCPTED 2020-12-15 17:53:36

What's the fastest way of finding a random index in a Python list, a large number of times?

Question

1 answers

solution1 1 ACCPTED 2020-12-15 17:53:36

solution1
1 ACCPTED 2020-12-15 17:53:36