简体   繁体   中英

How to generate this custom alpha-numeric sequence?

I would like to create a program that generate a particular long 7 characters string.

It must follow this rules:

  1. 0-9 are before az which are before AZ

  2. Length is 7 characters.

  3. Each character must be different from the two close (Example 'NN' is not allowed)

  4. I need all the possible combination incrementing from 0000000 to ZZZZZZZ but not in a random sequence

I have already done it with this code:

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

chars = digits + ascii_lowercase + ascii_uppercase

for n in range(7, 8):
    for comb in product(chars, repeat=n):
        if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
            print ''.join(comb)

But it is not performant at all because i have to wait a long time before the next combination.

Can someone help me?

Edit : I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours.

I'll let the docstring do all of the talking for describing the solution.

"""
Problem:
    Generate a string of N characters that only contains alphanumerical
    characters. The following restrictions apply:
        * 0-9 must come before a-z, which must come before A-Z
        * it's valid to not have any digits or letters in a sequence
        * no neighbouring characters can be the same
        * the sequences must be in an order as if the string is base62, e.g.,
          01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc

Solution:
    Implement a recursive approach which discards invalid trees. For example,
    for "---" start with "0--" and recurse. Try "00-", but discard it for
    "01-". The first and last sequences would then be "010" and "ZYZ".

    If the previous character in the sequence is a lowercase letter, such as
    in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
    for "9gB-", we should only be working with A-Z.

    The input also allows to define a specific sequence to start from. For
    example, for "abGH", each character will have access to a limited set of
    its pool. In this case, the last letter can iterate from H to Z, at which
    point it'll be free to iterate its whole character pool next time around.

    When specifying a starting sequence, if it doesn't have enough characters
    compared to `length`, it will be padded to the right with characters free
    to explore their character pool. For example, for length 4, the starting
    sequence "29" will be transformed to "29  ", where we will deal with two
    restricted characters temporarily.

    For long lengths the function internally calls a routine which relies on
    fewer recursions and cached results. Length 4 has been chosen as optimal
    in terms of precomputing time and memory demands. Briefly, the sequence is
    broken into a remainder and chunks of 4. For each preceeding valid
    subsequence, all valid following subsequences are fetched. For example, a
    sequence of six would be split into "--|----" and for "fB|----" all
    subsequences of 4 starting A, C, D, etc would be produced.

Examples:
    >>> for i, x in enumerate(generate_sequences(7)):
    ...    print i, x
    0, 0101010
    1, 0101012
    etc

    >>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
    ...    print i, x
    0, 012abcAB
    1, 012abcAC
    etc

    >>> for i, x in enumerate(generate_sequences(7, 'aB')):
    ...    print i, x
    0, aBABABA
    1, aBABABC
    etc
"""

import string

ALLOWED_CHARS = (string.digits + string.ascii_letters,
                 string.ascii_letters,
                 string.ascii_uppercase,
                 )
CACHE_LEN = 4

def _generate_sequences(length, sequence, previous=''):
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        sequence[-length] = ' '
    char_set = char_set.replace(previous, '')

    if length == 1:
        for char in char_set:
            yield char
    else:
        for char in char_set:
            for seq in _generate_sequences(length-1, sequence, char):
                yield char + seq

def _generate_sequences_cache(length, sequence, cache, previous=''):
    sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
    subseq = cache[sublength != CACHE_LEN]
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        index = len(sequence) - length
        subseq0 = ''.join(sequence[index:index+sublength]).strip()
        sequence[index:index+sublength] = [' '] * sublength
        if len(subseq0) > 1:
            subseq[char_set[0]] = tuple(
                    s for s in subseq[char_set[0]] if s.startswith(subseq0))
    char_set = char_set.replace(previous, '')

    if length == CACHE_LEN:
        for char in char_set:
            for seq in subseq[char]:
                yield seq
    else:
        for char in char_set:
            for seq1 in subseq[char]:
                for seq2 in _generate_sequences_cache(
                                length-sublength, sequence, cache, seq1[-1]):
                    yield seq1 + seq2

def precompute(length):
    char_set = ALLOWED_CHARS[0]
    if length > 1:
        sequence = [' '] * length
        result = {}
        for char in char_set:
            result[char] = tuple(char + seq for seq in  _generate_sequences(
                                                     length-1, sequence, char))
    else:
        result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
    return result

def generate_sequences(length, sequence=''):
    # -------------------------------------------------------------------------
    # Error checking: consistency of the value/type of the arguments
    if not isinstance(length, int):
        msg = 'The sequence length must be an integer: {}'
        raise TypeError(msg.format(type(length)))
    if length < 0:
        msg = 'The sequence length must be greater or equal than 0: {}'
        raise ValueError(msg.format(length))
    if not isinstance(sequence, str):
        msg = 'The sequence must be a string: {}'
        raise TypeError(msg.format(type(sequence)))
    if len(sequence) > length:
        msg = 'The sequence has length greater than {}'
        raise ValueError(msg.format(length))
    # -------------------------------------------------------------------------
    if not length:
        yield ''
    else:
        # ---------------------------------------------------------------------
        # Error checking: the starting sequence, if provided, must be valid
        if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
            msg = 'The sequence contains invalid characters: {}'
            raise ValueError(msg.format(sequence))
        if sequence.strip() != sequence.replace(' ', ''):
            msg = 'Uninitiated characters in the middle of the sequence: {}'
            raise ValueError(msg.format(sequence.strip()))
        sequence = sequence.strip()
        if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
            msg = 'No neighbours must be the same character: {}'
            raise ValueError(msg.format(sequence))
        char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
        if char_type != sorted(char_type):
            msg = '0-9 must come before a-z, which must come before A-Z: {}'
            raise ValueError(msg.format(sequence))
        # ---------------------------------------------------------------------
        sequence = list(sequence.ljust(length))
        if length <= CACHE_LEN:
            for s in _generate_sequences(length, sequence):
                yield s
        else:
            remainder = length % CACHE_LEN
            if not remainder:
                cache = tuple((precompute(CACHE_LEN),))
            else:
                cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
            for s in _generate_sequences_cache(length, sequence, cache):
                yield s

I've included thorough error checks in the generate_sequences() function. For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. Specifically, invalid starting sequences.

Counting number of sequences of specific length

While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total.

The sequences can effectively be broken down to 3 separate subsequences. Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for python , which we'll borrow. The first few partitions are:

[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...

Next, we need to calculate how many valid sequences we have within a partition. Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together.

So, how many digit combinations we can have for a length of 4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). Similarly for the third letter and so on. So the total number of valid subsequences is 10*9*9*9. Similarly, for length 3 for letters, we get 26*25*25. Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations.

import itertools as it

def partitions(n, k):
    for c in it.combinations(xrange(n+k-1), k-1):
        yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]

def count_subsequences(pool, length):
    if length < 2:
        return pool**length
    return pool * (pool-1)**(length-1)

def count_sequences(length):
    counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
              for i in [10, 26]]

    print 'Partition {:>18}'.format('Sequence count')

    total = 0
    for a, b, c in partitions(length, 3):
        subtotal = counts[0][a] * counts[1][b] * counts[1][c]
        total += subtotal
        print '{} {:18}'.format((a, b, c), subtotal)
    print '\nTOTAL {:22}'.format(total)

Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length.

Extreme cases are not handled here but can be done this way

import random
from string import digits, ascii_uppercase, ascii_lowercase

len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))

Try this

import string
import random

a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)

If it's a random string you want that sticks to the above rules you can use something like this:

def f():
  digitLen = random.randrange(8)
  smallCharLen = random.randint(0, 7 - digitLen)
  capCharLen = 7 - (smallCharLen + digitLen)
  print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
      "".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
      "".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))

I haven't added the repeated character rule but one you have the string it's easy to filter out the unwanted strings using dictionaries. You can also fix the length of each segment by putting conditions on the segment lengths.

Edit: a minor bug.

The reason it takes a long time to generate the first result with the original implementation is it takes a long time to reach the first valid value of 0101010 when starting from 0000000 as you do when using product.

Here's a recursive version which generates valid sequences rather than discarding invalid ones:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=None):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=None, exclude=None):
    left_chars, remaining_chars=char_set[0], char_set[1:]
    if start:
        try:
            left_chars=left_chars[left_chars.index(start[0]):]
            start=start[1:]
        except:
            left_chars=''
    for left in left_chars:
        if left != exclude:
            if len(remaining_chars) > 0:
                for right in seqperm(remaining_chars, start, left):
                    yield left + right
            else:
                yield left

if __name__ == "__main__":
    count=int(argv[1])
    start=None
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

Run as follows:

./permute.py 10 

Output:

0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a

If you pass an additional argument then the script skips to the portion of the sequence which starts with that third argument like this:

./permute.py 10 01234Z

If it's a requirement to generate only permutations where lower letters always follow numbers and upper case always follow numbers and lower case then comment out the line char_sets=[[''.join(all_chars)] * 7] and use the line char_sets=list(combinations_with_replacement(all_chars, 7)) .

Sample output for the above command line with char_sets=list(combinations_with_replacement(all_chars, 7)) :

0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ

Sample output for the same command line with char_sets=[[''.join(all_chars)] * 7] :

0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9

It's possible to implement the above without recursion as below. Performance characteristics don't change much:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=''):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=''):
    iters=[iter(chars) for chars in char_set]
    # move to starting point in sequence if specified
    for char, citer, chars in zip(list(start), iters, char_set):
        try:
            for _ in range(0, chars.index(char)):
                citer.next()
        except ValueError:
            raise StopIteration
    pos=0
    val=''
    while True:
        citer=iters[pos]
        try:
            char=citer.next()
            if val and val[-1] == char:
                char=citer.next()
            if pos == len(char_set) - 1:
                yield val+char
            else:
                val = val + char
                pos += 1
        except StopIteration:
            if pos == 0:
                raise StopIteration
            iters[pos] = iter(chars)
            pos -= 1
            val=val[:pos]

if __name__ == "__main__":
    count=int(argv[1])
    start=''
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

A recursive version with caching is also possible and that generates results faster but is less flexible.

with a similar approach of @julian

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap

def flatten(listOfLists):
    "Flatten one level of nesting"
    #recipe of itertools
    return chain.from_iterable(listOfLists)

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    #recipe of itertools
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def eq_pair(x):
    return x[0]==x[1]

def comb_noNN(alfa,size):
    if size>0:
        for candidato in product(alfa,repeat=size):
            if not any( imap(eq_pair,pairwise(candidato)) ):
                yield candidato
    else:
        yield tuple()

def my_string(N=7):
    for a in range(N+1):
        for b in range(N-a+1):
            for c in range(N-a-b+1):
                if sum([a,b,c])==N:
                    for letras in product(
                            comb_noNN(digits,c),
                            comb_noNN(ascii_lowercase,b),
                            comb_noNN(ascii_uppercase,a)
                            ):
                        yield "".join(flatten(letras))

comb_noNN generate all combinations of char of a particular size that follow rule 3, then in my_string check all combination of length that add up to N and generate all string that follow rule 1 by individually generating each of digits, lower and upper case letters.

Some output of for i,x in enumerate(my_string())

0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi' 
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM