简体   繁体   English

如何生成这个自定义字母数字序列?

[英]How to generate this custom alpha-numeric sequence?

I would like to create a program that generate a particular long 7 characters string. 我想创建一个生成特定长7字符字符串的程序。

It must follow this rules: 它必须遵循以下规则:

  1. 0-9 are before az which are before AZ 0-9在az之前,在AZ之前

  2. Length is 7 characters. 长度为7个字符。

  3. Each character must be different from the two close (Example 'NN' is not allowed) 每个字符必须与两个字符不同(不允许使用示例'NN')

  4. I need all the possible combination incrementing from 0000000 to ZZZZZZZ but not in a random sequence 我需要所有可能的组合从0000000递增到ZZZZZZZ,但不是随机序列

I have already done it with this code: 我已经使用此代码完成了它:

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

chars = digits + ascii_lowercase + ascii_uppercase

for n in range(7, 8):
    for comb in product(chars, repeat=n):
        if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
            print ''.join(comb)

But it is not performant at all because i have to wait a long time before the next combination. 但它根本不具备高效性,因为我必须在下一次组合之前等待很长时间。

Can someone help me? 有人能帮我吗?

Edit : I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. 编辑 :我已经更新了解决方案,使用长度大于4的缓存短序列。这大大加快了计算速度。 With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours. 使用简单版本,生成长度为7的所有序列需要18.5小时,但新方法只需4.5小时。

I'll let the docstring do all of the talking for describing the solution. 我会让docstring做所有的讨论来描述解决方案。

"""
Problem:
    Generate a string of N characters that only contains alphanumerical
    characters. The following restrictions apply:
        * 0-9 must come before a-z, which must come before A-Z
        * it's valid to not have any digits or letters in a sequence
        * no neighbouring characters can be the same
        * the sequences must be in an order as if the string is base62, e.g.,
          01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc

Solution:
    Implement a recursive approach which discards invalid trees. For example,
    for "---" start with "0--" and recurse. Try "00-", but discard it for
    "01-". The first and last sequences would then be "010" and "ZYZ".

    If the previous character in the sequence is a lowercase letter, such as
    in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
    for "9gB-", we should only be working with A-Z.

    The input also allows to define a specific sequence to start from. For
    example, for "abGH", each character will have access to a limited set of
    its pool. In this case, the last letter can iterate from H to Z, at which
    point it'll be free to iterate its whole character pool next time around.

    When specifying a starting sequence, if it doesn't have enough characters
    compared to `length`, it will be padded to the right with characters free
    to explore their character pool. For example, for length 4, the starting
    sequence "29" will be transformed to "29  ", where we will deal with two
    restricted characters temporarily.

    For long lengths the function internally calls a routine which relies on
    fewer recursions and cached results. Length 4 has been chosen as optimal
    in terms of precomputing time and memory demands. Briefly, the sequence is
    broken into a remainder and chunks of 4. For each preceeding valid
    subsequence, all valid following subsequences are fetched. For example, a
    sequence of six would be split into "--|----" and for "fB|----" all
    subsequences of 4 starting A, C, D, etc would be produced.

Examples:
    >>> for i, x in enumerate(generate_sequences(7)):
    ...    print i, x
    0, 0101010
    1, 0101012
    etc

    >>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
    ...    print i, x
    0, 012abcAB
    1, 012abcAC
    etc

    >>> for i, x in enumerate(generate_sequences(7, 'aB')):
    ...    print i, x
    0, aBABABA
    1, aBABABC
    etc
"""

import string

ALLOWED_CHARS = (string.digits + string.ascii_letters,
                 string.ascii_letters,
                 string.ascii_uppercase,
                 )
CACHE_LEN = 4

def _generate_sequences(length, sequence, previous=''):
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        sequence[-length] = ' '
    char_set = char_set.replace(previous, '')

    if length == 1:
        for char in char_set:
            yield char
    else:
        for char in char_set:
            for seq in _generate_sequences(length-1, sequence, char):
                yield char + seq

def _generate_sequences_cache(length, sequence, cache, previous=''):
    sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
    subseq = cache[sublength != CACHE_LEN]
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        index = len(sequence) - length
        subseq0 = ''.join(sequence[index:index+sublength]).strip()
        sequence[index:index+sublength] = [' '] * sublength
        if len(subseq0) > 1:
            subseq[char_set[0]] = tuple(
                    s for s in subseq[char_set[0]] if s.startswith(subseq0))
    char_set = char_set.replace(previous, '')

    if length == CACHE_LEN:
        for char in char_set:
            for seq in subseq[char]:
                yield seq
    else:
        for char in char_set:
            for seq1 in subseq[char]:
                for seq2 in _generate_sequences_cache(
                                length-sublength, sequence, cache, seq1[-1]):
                    yield seq1 + seq2

def precompute(length):
    char_set = ALLOWED_CHARS[0]
    if length > 1:
        sequence = [' '] * length
        result = {}
        for char in char_set:
            result[char] = tuple(char + seq for seq in  _generate_sequences(
                                                     length-1, sequence, char))
    else:
        result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
    return result

def generate_sequences(length, sequence=''):
    # -------------------------------------------------------------------------
    # Error checking: consistency of the value/type of the arguments
    if not isinstance(length, int):
        msg = 'The sequence length must be an integer: {}'
        raise TypeError(msg.format(type(length)))
    if length < 0:
        msg = 'The sequence length must be greater or equal than 0: {}'
        raise ValueError(msg.format(length))
    if not isinstance(sequence, str):
        msg = 'The sequence must be a string: {}'
        raise TypeError(msg.format(type(sequence)))
    if len(sequence) > length:
        msg = 'The sequence has length greater than {}'
        raise ValueError(msg.format(length))
    # -------------------------------------------------------------------------
    if not length:
        yield ''
    else:
        # ---------------------------------------------------------------------
        # Error checking: the starting sequence, if provided, must be valid
        if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
            msg = 'The sequence contains invalid characters: {}'
            raise ValueError(msg.format(sequence))
        if sequence.strip() != sequence.replace(' ', ''):
            msg = 'Uninitiated characters in the middle of the sequence: {}'
            raise ValueError(msg.format(sequence.strip()))
        sequence = sequence.strip()
        if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
            msg = 'No neighbours must be the same character: {}'
            raise ValueError(msg.format(sequence))
        char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
        if char_type != sorted(char_type):
            msg = '0-9 must come before a-z, which must come before A-Z: {}'
            raise ValueError(msg.format(sequence))
        # ---------------------------------------------------------------------
        sequence = list(sequence.ljust(length))
        if length <= CACHE_LEN:
            for s in _generate_sequences(length, sequence):
                yield s
        else:
            remainder = length % CACHE_LEN
            if not remainder:
                cache = tuple((precompute(CACHE_LEN),))
            else:
                cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
            for s in _generate_sequences_cache(length, sequence, cache):
                yield s

I've included thorough error checks in the generate_sequences() function. 我在generate_sequences()函数中包含了彻底的错误检查。 For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. 为了简洁起见,如果您可以保证调用该函数的人永远不会使用无效输入,则可以删除它们。 Specifically, invalid starting sequences. 具体而言,起始序列无效。

Counting number of sequences of specific length 计算特定长度的序列数

While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total. 虽然该函数将按顺序生成序列,但我们可以执行简单的组合计算,以计算总共存在多少有效序列。

The sequences can effectively be broken down to 3 separate subsequences. 序列可以有效地分解为3个独立的子序列。 Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. 一般来说,序列可以包含0到7位数字,后跟0到7个小写字母,后跟0到7个大写字母。 As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. 只要它们的总和是7.这意味着我们可以有分区(1,3,3),或(2,1,3),或(6,0,1)等。我们可以使用星星和条形来计算将N之和分成k个区间的各种组合。 There is already an implementation for python , which we'll borrow. 已经有python的实现,我们将借用它。 The first few partitions are: 前几个分区是:

[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...

Next, we need to calculate how many valid sequences we have within a partition. 接下来,我们需要计算分区中有多少有效序列。 Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together. 由于数字子序列独立于小写字母(独立于大写字母),因此我们可以单独计算它们并将它们相乘。

So, how many digit combinations we can have for a length of 4? 那么,我们可以拥有多少个数字组合,长度为4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). 第一个字符可以是10个数字中的任何一个,但第二个字符只有9个选项(10个减去前一个字符的选项)。 Similarly for the third letter and so on. 同样的第三个字母等。 So the total number of valid subsequences is 10*9*9*9. 因此有效子序列的总数是10 * 9 * 9 * 9。 Similarly, for length 3 for letters, we get 26*25*25. 同样,对于字母长度为3,我们得到26 * 25 * 25。 Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations. 总的来说,对于分区,比方说,(2,3,2),我们有10 * 9 * 26 * 25 * 25 * 26 * 25 = 950625000组合。

import itertools as it

def partitions(n, k):
    for c in it.combinations(xrange(n+k-1), k-1):
        yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]

def count_subsequences(pool, length):
    if length < 2:
        return pool**length
    return pool * (pool-1)**(length-1)

def count_sequences(length):
    counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
              for i in [10, 26]]

    print 'Partition {:>18}'.format('Sequence count')

    total = 0
    for a, b, c in partitions(length, 3):
        subtotal = counts[0][a] * counts[1][b] * counts[1][c]
        total += subtotal
        print '{} {:18}'.format((a, b, c), subtotal)
    print '\nTOTAL {:22}'.format(total)

Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. 总的来说,我们观察到虽然快速生成序列不是问题,但是有很多可能需要很长时间。 Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length. 长度7具有78550354750(785亿)个有效序列,并且该数字仅在每个递增长度的情况下大约缩放25倍。

Extreme cases are not handled here but can be done this way 极端情况不在这里处理,但可以这样做

import random
from string import digits, ascii_uppercase, ascii_lowercase

len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))

Try this 试试这个

import string
import random

a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)

If it's a random string you want that sticks to the above rules you can use something like this: 如果它是一个随机字符串,你想要遵守上述规则,你可以使用这样的东西:

def f():
  digitLen = random.randrange(8)
  smallCharLen = random.randint(0, 7 - digitLen)
  capCharLen = 7 - (smallCharLen + digitLen)
  print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
      "".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
      "".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))

I haven't added the repeated character rule but one you have the string it's easy to filter out the unwanted strings using dictionaries. 我没有添加重复的字符规则,但是你有一个字符串,很容易使用字典过滤掉不需要的字符串。 You can also fix the length of each segment by putting conditions on the segment lengths. 您还可以通过在段长度上添加条件来修复每个段的长度。

Edit: a minor bug. 编辑:一个小错误。

The reason it takes a long time to generate the first result with the original implementation is it takes a long time to reach the first valid value of 0101010 when starting from 0000000 as you do when using product. 使用原始实现生成第一个结果需要很长时间的原因是,当使用产品时,从0000000开始到达第一个有效值0101010需要很长时间。

Here's a recursive version which generates valid sequences rather than discarding invalid ones: 这是一个递归版本,它生成有效序列而不是丢弃无效序列:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=None):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=None, exclude=None):
    left_chars, remaining_chars=char_set[0], char_set[1:]
    if start:
        try:
            left_chars=left_chars[left_chars.index(start[0]):]
            start=start[1:]
        except:
            left_chars=''
    for left in left_chars:
        if left != exclude:
            if len(remaining_chars) > 0:
                for right in seqperm(remaining_chars, start, left):
                    yield left + right
            else:
                yield left

if __name__ == "__main__":
    count=int(argv[1])
    start=None
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

Run as follows: 运行如下:

./permute.py 10 

Output: 输出:

0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a

If you pass an additional argument then the script skips to the portion of the sequence which starts with that third argument like this: 如果你传递一个额外的参数,那么脚本将跳转到以第三个参数开头的序列部分,如下所示:

./permute.py 10 01234Z

If it's a requirement to generate only permutations where lower letters always follow numbers and upper case always follow numbers and lower case then comment out the line char_sets=[[''.join(all_chars)] * 7] and use the line char_sets=list(combinations_with_replacement(all_chars, 7)) . 如果要求仅生成排列,其中较低的字母始终遵循数字,大写字母始终遵循数字和小写,则注释掉行char_sets=[[''.join(all_chars)] * 7]并使用行char_sets=list(combinations_with_replacement(all_chars, 7))

Sample output for the above command line with char_sets=list(combinations_with_replacement(all_chars, 7)) : 使用char_sets=list(combinations_with_replacement(all_chars, 7))的上述命令行的示例输出:

0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ

Sample output for the same command line with char_sets=[[''.join(all_chars)] * 7] : 使用char_sets=[[''.join(all_chars)] * 7]的同一命令行的示例输出:

0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9

It's possible to implement the above without recursion as below. 可以在不递归的情况下实现上述内容,如下所示。 Performance characteristics don't change much: 性能特征变化不大:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=''):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=''):
    iters=[iter(chars) for chars in char_set]
    # move to starting point in sequence if specified
    for char, citer, chars in zip(list(start), iters, char_set):
        try:
            for _ in range(0, chars.index(char)):
                citer.next()
        except ValueError:
            raise StopIteration
    pos=0
    val=''
    while True:
        citer=iters[pos]
        try:
            char=citer.next()
            if val and val[-1] == char:
                char=citer.next()
            if pos == len(char_set) - 1:
                yield val+char
            else:
                val = val + char
                pos += 1
        except StopIteration:
            if pos == 0:
                raise StopIteration
            iters[pos] = iter(chars)
            pos -= 1
            val=val[:pos]

if __name__ == "__main__":
    count=int(argv[1])
    start=''
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

A recursive version with caching is also possible and that generates results faster but is less flexible. 带缓存的递归版本也是可能的,它可以更快地生成结果,但灵活性较差。

with a similar approach of @julian 与@julian类似的方法

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap

def flatten(listOfLists):
    "Flatten one level of nesting"
    #recipe of itertools
    return chain.from_iterable(listOfLists)

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    #recipe of itertools
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def eq_pair(x):
    return x[0]==x[1]

def comb_noNN(alfa,size):
    if size>0:
        for candidato in product(alfa,repeat=size):
            if not any( imap(eq_pair,pairwise(candidato)) ):
                yield candidato
    else:
        yield tuple()

def my_string(N=7):
    for a in range(N+1):
        for b in range(N-a+1):
            for c in range(N-a-b+1):
                if sum([a,b,c])==N:
                    for letras in product(
                            comb_noNN(digits,c),
                            comb_noNN(ascii_lowercase,b),
                            comb_noNN(ascii_uppercase,a)
                            ):
                        yield "".join(flatten(letras))

comb_noNN generate all combinations of char of a particular size that follow rule 3, then in my_string check all combination of length that add up to N and generate all string that follow rule 1 by individually generating each of digits, lower and upper case letters. comb_noNN生成遵循规则3的特定大小的char的所有组合,然后在my_string检查加起来N的所有长度组合,并通过单独生成每个数字,小写和大写字母生成遵循规则1的所有字符串。

Some output of for i,x in enumerate(my_string()) for i,x in enumerate(my_string())一些输出for i,x in enumerate(my_string())

0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi' 
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM