unique string ID gererator algorithm

Question

I'm reading a opensource project aventrix/jnanoid , I can't understand mask and step in the code

public static String randomNanoId(final Random random, final char[] alphabet, final int size) {

    if (random == null) {
        throw new IllegalArgumentException("random cannot be null.");
    }

    if (alphabet == null) {
        throw new IllegalArgumentException("alphabet cannot be null.");
    }

    if (alphabet.length == 0 || alphabet.length >= 256) {
        throw new IllegalArgumentException("alphabet must contain between 1 and 255 symbols.");
    }

    if (size <= 0) {
        throw new IllegalArgumentException("size must be greater than zero.");
    }

    final int mask = (2 << (int) Math.floor(Math.log(alphabet.length - 1) / Math.log(2))) - 1;
    final int step = (int) Math.ceil(1.6 * mask * size / alphabet.length);

    final StringBuilder idBuilder = new StringBuilder();

    while (true) {

        final byte[] bytes = new byte[step];
        random.nextBytes(bytes);

        for (int i = 0; i < step; i++) {

            final int alphabetIndex = bytes[i] & mask;

            if (alphabetIndex < alphabet.length) {
                idBuilder.append(alphabet[alphabetIndex]);
                if (idBuilder.length() == size) {
                    return idBuilder.toString();
                }
            }

        }

    }

}

Answer 1

In the loop we're attempting to choose a random member (letter) of our alphabet on each loop iteration. Note that this can fail in a given loop iteration because we can get an index greater than the length of our alphabet. We're choosing our letter by creating an array of random bytes. We then use just enough bits of each random byte to ensure we can choose any letter in the alphabet. So if our alphabet has two characters one bit will be enough, since it can have value 0 or 1. If our alphabet has 9 characters we need four bits, since three bits can only represent 8 values (0-7). This is what the mask is. We're masking off the bottom four bits of the random byte and using those as an index into our alphabet. So if our random byte is 11000110 we use the bottom four bits (0110) in my example for a 9-character alphabet. 0110 is 6 in decimal, so we pick the letter at index 6 in our character array. Now you can see how it can fail. If our random byte is 01001101 then we mask 1101 which is 13 in decimal, which is beyond the end of our 9-letter alphabet. As mentioned, the code to set up the mask ensures that it is long enough to cover the entire alphabet using this algorithm, but it can't prevent it being too long.

I have to say step looks a bit arbitrary to me. If you look at the loop step is the number of bytes we are trying in an attempt to get a random string of length size. We've seen that an individual loop iteration can fail to pick a letter, so step needs to be bigger than just size. How much longer depends on how big mask is relative to our alphabet length, so multiplying by mask/alphabet.length makes sense. We can still fail to pick enough random letters with just that though, so we bump up by what looks like an arbitrary factor of 1.6. Of course we can STILL fail (all of our random bytes could point to letters beyond the end of the alphabet), which is why we have while(true) in there: if it does fail we try again.

unique string ID gererator algorithm

Question

1 answers

solution1
0 ACCPTED 2019-09-18 16:09:40

unique string ID gererator algorithm

Question

1 answers

solution1 0 ACCPTED 2019-09-18 16:09:40

solution1
0 ACCPTED 2019-09-18 16:09:40