What is the most efficient way of selecting a random element from a long (and reasonably) sparse vector?

Question

I have a long, reasonably sparse boolean vector, that I want to iteratively select random elements from, and I was wondering what the most efficient way of doing so would be.

The vector can be up to around 100,000 elements long, and about 1 in every 20 elements will be "true" at any one time.

The selection of one of these elements, will occasionally result in making other elements available for selection; so I can't just do a single, initial pass of the boolean vector to get the indices of all the available elements and then shuffle that vector and pop elements, because the list of available elements changes.

I have worked out a couple of ideas, but can't really tell which would be best. So any insight would be greatly appreciated.

method 1:

given input boolean vector A
create boolean vector B    // to store previously selected elements
create int vector C        // to store currently available element indices 
while stopping condition not met:
    for each element a in A:
        if a is "true":
            append index of a to C
    generate random integer i between 0 and length of A
    set i-th element of C in A to "false"
    set i-th element of C in B to "true"
    compute any new "true" values of A

method 2:

given input boolean vector A
create boolean vector B    // to store previously selected elements
create int vector C        // to store currently available element indices 
for each element a in A:
    if a is "true":
        append index of a to C
shuffle C
while stopping condition not met:
    pop element from back of C
    set i-th element of C in A to "false"
    set i-th element of C in B to "true"
    compute any new "true" values of A
    if new values in A computed:
        append index of new available element to C 
        shuffle C

Because not every selection from A results in a change to the set of available elements, I think method 2 will potentially be better than 1, except for the fact that I am not sure how much effort shuffling a long vector will cause.

method 3:

given input boolean vector A
create boolean vector B    // to store previously selected elements
while stopping condition not met:
    generate random integer i between 0 and length of A
    If i is "true" in A:
        set i in A to "false"
        set i in B to "true"
        compute any new "true" values of A

This final way seems a bit naive and simple, but I figured that if there will be about 1 in every 20 elements being true (except for the last group of elements, when no more can be added for ones that are selected), then on average it would only need about 20 tries for it to find a selectable element, which could actually be less effort than doing a full pass of the input vector, or shuffling the vector of available indices (especially if the vectors in question are quite long). Finding the last few would be very hard, but I could keep track of how many have been selected, and once the amount left gets below a certain level I could change how it is selected for the final lot.

Does anyone have any idea as to which might be more efficient? The implementation will be in C++ if that makes any difference.

Thanks for your help

Answer 1

You can change the representation of your sparse vector to the following -

Primary vector (the vector you have right now)
True vector (a list of all "true" indices)

Your operations now become -

Insert:   
    check if i in Primary Vector
    if false, set to true and add to True Vector

Delete:
    check if i in Primary Vector
    if true, set to false and remove from True Vector by swapping
    with last element and reducing size

(You will need pointers from Primary Vector to True Vector for this).

Random:
    Generate random index j from size of (True Vector)
    return True Vector[j]

All your operations can be done with O(1) complexity.

Answer 2

This sounds like a case for an Van Emde Boas tree

Space   O(M)
Search  O(log log M)
Insert  O(log log M)
Delete  O(log log M)

Annotate the aux array with number of members to make finding the random element easier.

What is the most efficient way of selecting a random element from a long (and reasonably) sparse vector?

Question

2 answers

solution1
2 ACCPTED 2017-08-21 06:32:04

solution2
1 2017-08-21 06:27:04

What is the most efficient way of selecting a random element from a long (and reasonably) sparse vector?

Question

2 answers

solution1 2 ACCPTED 2017-08-21 06:32:04

solution2 1 2017-08-21 06:27:04

solution1
2 ACCPTED 2017-08-21 06:32:04

solution2
1 2017-08-21 06:27:04