簡體   English   中英

如何生成這個自定義字母數字序列?

[英]How to generate this custom alpha-numeric sequence?

我想創建一個生成特定長7字符字符串的程序。

它必須遵循以下規則:

  1. 0-9在az之前,在AZ之前

  2. 長度為7個字符。

  3. 每個字符必須與兩個字符不同(不允許使用示例'NN')

  4. 我需要所有可能的組合從0000000遞增到ZZZZZZZ,但不是隨機序列

我已經使用此代碼完成了它:

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

chars = digits + ascii_lowercase + ascii_uppercase

for n in range(7, 8):
    for comb in product(chars, repeat=n):
        if (comb[6] != comb[5] and comb[5] != comb[4] and comb[4] != comb[3] and comb[3] != comb[2] and comb[2] != comb[1] and comb[1] != comb[0]):
            print ''.join(comb)

但它根本不具備高效性,因為我必須在下一次組合之前等待很長時間。

有人能幫我嗎?

編輯 :我已經更新了解決方案,使用長度大於4的緩存短序列。這大大加快了計算速度。 使用簡單版本,生成長度為7的所有序列需要18.5小時,但新方法只需4.5小時。

我會讓docstring做所有的討論來描述解決方案。

"""
Problem:
    Generate a string of N characters that only contains alphanumerical
    characters. The following restrictions apply:
        * 0-9 must come before a-z, which must come before A-Z
        * it's valid to not have any digits or letters in a sequence
        * no neighbouring characters can be the same
        * the sequences must be in an order as if the string is base62, e.g.,
          01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc

Solution:
    Implement a recursive approach which discards invalid trees. For example,
    for "---" start with "0--" and recurse. Try "00-", but discard it for
    "01-". The first and last sequences would then be "010" and "ZYZ".

    If the previous character in the sequence is a lowercase letter, such as
    in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
    for "9gB-", we should only be working with A-Z.

    The input also allows to define a specific sequence to start from. For
    example, for "abGH", each character will have access to a limited set of
    its pool. In this case, the last letter can iterate from H to Z, at which
    point it'll be free to iterate its whole character pool next time around.

    When specifying a starting sequence, if it doesn't have enough characters
    compared to `length`, it will be padded to the right with characters free
    to explore their character pool. For example, for length 4, the starting
    sequence "29" will be transformed to "29  ", where we will deal with two
    restricted characters temporarily.

    For long lengths the function internally calls a routine which relies on
    fewer recursions and cached results. Length 4 has been chosen as optimal
    in terms of precomputing time and memory demands. Briefly, the sequence is
    broken into a remainder and chunks of 4. For each preceeding valid
    subsequence, all valid following subsequences are fetched. For example, a
    sequence of six would be split into "--|----" and for "fB|----" all
    subsequences of 4 starting A, C, D, etc would be produced.

Examples:
    >>> for i, x in enumerate(generate_sequences(7)):
    ...    print i, x
    0, 0101010
    1, 0101012
    etc

    >>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
    ...    print i, x
    0, 012abcAB
    1, 012abcAC
    etc

    >>> for i, x in enumerate(generate_sequences(7, 'aB')):
    ...    print i, x
    0, aBABABA
    1, aBABABC
    etc
"""

import string

ALLOWED_CHARS = (string.digits + string.ascii_letters,
                 string.ascii_letters,
                 string.ascii_uppercase,
                 )
CACHE_LEN = 4

def _generate_sequences(length, sequence, previous=''):
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        sequence[-length] = ' '
    char_set = char_set.replace(previous, '')

    if length == 1:
        for char in char_set:
            yield char
    else:
        for char in char_set:
            for seq in _generate_sequences(length-1, sequence, char):
                yield char + seq

def _generate_sequences_cache(length, sequence, cache, previous=''):
    sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
    subseq = cache[sublength != CACHE_LEN]
    char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
    if sequence[-length] != ' ':
        char_set = char_set[char_set.find(sequence[-length]):]
        index = len(sequence) - length
        subseq0 = ''.join(sequence[index:index+sublength]).strip()
        sequence[index:index+sublength] = [' '] * sublength
        if len(subseq0) > 1:
            subseq[char_set[0]] = tuple(
                    s for s in subseq[char_set[0]] if s.startswith(subseq0))
    char_set = char_set.replace(previous, '')

    if length == CACHE_LEN:
        for char in char_set:
            for seq in subseq[char]:
                yield seq
    else:
        for char in char_set:
            for seq1 in subseq[char]:
                for seq2 in _generate_sequences_cache(
                                length-sublength, sequence, cache, seq1[-1]):
                    yield seq1 + seq2

def precompute(length):
    char_set = ALLOWED_CHARS[0]
    if length > 1:
        sequence = [' '] * length
        result = {}
        for char in char_set:
            result[char] = tuple(char + seq for seq in  _generate_sequences(
                                                     length-1, sequence, char))
    else:
        result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
    return result

def generate_sequences(length, sequence=''):
    # -------------------------------------------------------------------------
    # Error checking: consistency of the value/type of the arguments
    if not isinstance(length, int):
        msg = 'The sequence length must be an integer: {}'
        raise TypeError(msg.format(type(length)))
    if length < 0:
        msg = 'The sequence length must be greater or equal than 0: {}'
        raise ValueError(msg.format(length))
    if not isinstance(sequence, str):
        msg = 'The sequence must be a string: {}'
        raise TypeError(msg.format(type(sequence)))
    if len(sequence) > length:
        msg = 'The sequence has length greater than {}'
        raise ValueError(msg.format(length))
    # -------------------------------------------------------------------------
    if not length:
        yield ''
    else:
        # ---------------------------------------------------------------------
        # Error checking: the starting sequence, if provided, must be valid
        if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
            msg = 'The sequence contains invalid characters: {}'
            raise ValueError(msg.format(sequence))
        if sequence.strip() != sequence.replace(' ', ''):
            msg = 'Uninitiated characters in the middle of the sequence: {}'
            raise ValueError(msg.format(sequence.strip()))
        sequence = sequence.strip()
        if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
            msg = 'No neighbours must be the same character: {}'
            raise ValueError(msg.format(sequence))
        char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
        if char_type != sorted(char_type):
            msg = '0-9 must come before a-z, which must come before A-Z: {}'
            raise ValueError(msg.format(sequence))
        # ---------------------------------------------------------------------
        sequence = list(sequence.ljust(length))
        if length <= CACHE_LEN:
            for s in _generate_sequences(length, sequence):
                yield s
        else:
            remainder = length % CACHE_LEN
            if not remainder:
                cache = tuple((precompute(CACHE_LEN),))
            else:
                cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
            for s in _generate_sequences_cache(length, sequence, cache):
                yield s

我在generate_sequences()函數中包含了徹底的錯誤檢查。 為了簡潔起見,如果您可以保證調用該函數的人永遠不會使用無效輸入,則可以刪除它們。 具體而言,起始序列無效。

計算特定長度的序列數

雖然該函數將按順序生成序列,但我們可以執行簡單的組合計算,以計算總共存在多少有效序列。

序列可以有效地分解為3個獨立的子序列。 一般來說,序列可以包含0到7位數字,后跟0到7個小寫字母,后跟0到7個大寫字母。 只要它們的總和是7.這意味着我們可以有分區(1,3,3),或(2,1,3),或(6,0,1)等。我們可以使用星星和條形來計算將N之和分成k個區間的各種組合。 已經有python的實現,我們將借用它。 前幾個分區是:

[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...

接下來,我們需要計算分區中有多少有效序列。 由於數字子序列獨立於小寫字母(獨立於大寫字母),因此我們可以單獨計算它們並將它們相乘。

那么,我們可以擁有多少個數字組合,長度為4? 第一個字符可以是10個數字中的任何一個,但第二個字符只有9個選項(10個減去前一個字符的選項)。 同樣的第三個字母等。 因此有效子序列的總數是10 * 9 * 9 * 9。 同樣,對於字母長度為3,我們得到26 * 25 * 25。 總的來說,對於分區,比方說,(2,3,2),我們有10 * 9 * 26 * 25 * 25 * 26 * 25 = 950625000組合。

import itertools as it

def partitions(n, k):
    for c in it.combinations(xrange(n+k-1), k-1):
        yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]

def count_subsequences(pool, length):
    if length < 2:
        return pool**length
    return pool * (pool-1)**(length-1)

def count_sequences(length):
    counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
              for i in [10, 26]]

    print 'Partition {:>18}'.format('Sequence count')

    total = 0
    for a, b, c in partitions(length, 3):
        subtotal = counts[0][a] * counts[1][b] * counts[1][c]
        total += subtotal
        print '{} {:18}'.format((a, b, c), subtotal)
    print '\nTOTAL {:22}'.format(total)

總的來說,我們觀察到雖然快速生成序列不是問題,但是有很多可能需要很長時間。 長度7具有78550354750(785億)個有效序列,並且該數字僅在每個遞增長度的情況下大約縮放25倍。

極端情況不在這里處理,但可以這樣做

import random
from string import digits, ascii_uppercase, ascii_lowercase

len1 = random.randint(1, 7)
len2 = random.randint(1, 7-len1)
len3 = 7 - len1 - len2
print len1, len2, len3
result = ''.join(random.sample(digits, len1) + random.sample(ascii_lowercase, len2) + random.sample(ascii_uppercase, len3))

試試這個

import string
import random

a = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits) for _ in range(7))
print(a)

如果它是一個隨機字符串,你想要遵守上述規則,你可以使用這樣的東西:

def f():
  digitLen = random.randrange(8)
  smallCharLen = random.randint(0, 7 - digitLen)
  capCharLen = 7 - (smallCharLen + digitLen)
  print (str(random.randint(0,10**digitLen-1)).zfill(digitLen) +
      "".join([random.choice(ascii_lowercase) for i in range(smallCharLen)]) +
      "".join([random.choice(ascii_uppercase) for i in range(capCharLen)]))

我沒有添加重復的字符規則,但是你有一個字符串,很容易使用字典過濾掉不需要的字符串。 您還可以通過在段長度上添加條件來修復每個段的長度。

編輯:一個小錯誤。

使用原始實現生成第一個結果需要很長時間的原因是,當使用產品時,從0000000開始到達第一個有效值0101010需要很長時間。

這是一個遞歸版本,它生成有效序列而不是丟棄無效序列:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=None):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=None, exclude=None):
    left_chars, remaining_chars=char_set[0], char_set[1:]
    if start:
        try:
            left_chars=left_chars[left_chars.index(start[0]):]
            start=start[1:]
        except:
            left_chars=''
    for left in left_chars:
        if left != exclude:
            if len(remaining_chars) > 0:
                for right in seqperm(remaining_chars, start, left):
                    yield left + right
            else:
                yield left

if __name__ == "__main__":
    count=int(argv[1])
    start=None
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

運行如下:

./permute.py 10 

輸出:

0 0101010
1 0101012
2 0101013
3 0101014
4 0101015
5 0101016
6 0101017
7 0101018
8 0101019
9 010101a

如果你傳遞一個額外的參數,那么腳本將跳轉到以第三個參數開頭的序列部分,如下所示:

./permute.py 10 01234Z

如果要求僅生成排列,其中較低的字母始終遵循數字,大寫字母始終遵循數字和小寫,則注釋掉行char_sets=[[''.join(all_chars)] * 7]並使用行char_sets=list(combinations_with_replacement(all_chars, 7))

使用char_sets=list(combinations_with_replacement(all_chars, 7))的上述命令行的示例輸出:

0 01234ZA
1 01234ZB
2 01234ZC
3 01234ZD
4 01234ZE
5 01234ZF
6 01234ZG
7 01234ZH
8 01234ZI
9 01234ZJ

使用char_sets=[[''.join(all_chars)] * 7]的同一命令行的示例輸出:

0 01234Z0
1 01234Z1
2 01234Z2
3 01234Z3
4 01234Z4
5 01234Z5
6 01234Z6
7 01234Z7
8 01234Z8
9 01234Z9

可以在不遞歸的情況下實現上述內容,如下所示。 性能特征變化不大:

from string import digits, ascii_uppercase, ascii_lowercase
from sys import argv
from itertools import combinations_with_replacement, product, izip_longest

all_chars=[digits, ascii_lowercase, ascii_uppercase]

def seq(char_sets, start=''):
    for char_set in char_sets:
        for val in seqperm(char_set, start):
            yield val

def seqperm(char_set, start=''):
    iters=[iter(chars) for chars in char_set]
    # move to starting point in sequence if specified
    for char, citer, chars in zip(list(start), iters, char_set):
        try:
            for _ in range(0, chars.index(char)):
                citer.next()
        except ValueError:
            raise StopIteration
    pos=0
    val=''
    while True:
        citer=iters[pos]
        try:
            char=citer.next()
            if val and val[-1] == char:
                char=citer.next()
            if pos == len(char_set) - 1:
                yield val+char
            else:
                val = val + char
                pos += 1
        except StopIteration:
            if pos == 0:
                raise StopIteration
            iters[pos] = iter(chars)
            pos -= 1
            val=val[:pos]

if __name__ == "__main__":
    count=int(argv[1])
    start=''
    if len(argv) == 3:
        start=argv[2]
    # char_sets=list(combinations_with_replacement(all_chars, 7))
    char_sets=[[''.join(all_chars)] * 7]
    for idx, val in enumerate(seq(char_sets, start)):
        if idx == count:
            break
        print idx, val

帶緩存的遞歸版本也是可能的,它可以更快地生成結果,但靈活性較差。

與@julian類似的方法

from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product, tee, chain, izip, imap

def flatten(listOfLists):
    "Flatten one level of nesting"
    #recipe of itertools
    return chain.from_iterable(listOfLists)

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    #recipe of itertools
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def eq_pair(x):
    return x[0]==x[1]

def comb_noNN(alfa,size):
    if size>0:
        for candidato in product(alfa,repeat=size):
            if not any( imap(eq_pair,pairwise(candidato)) ):
                yield candidato
    else:
        yield tuple()

def my_string(N=7):
    for a in range(N+1):
        for b in range(N-a+1):
            for c in range(N-a-b+1):
                if sum([a,b,c])==N:
                    for letras in product(
                            comb_noNN(digits,c),
                            comb_noNN(ascii_lowercase,b),
                            comb_noNN(ascii_uppercase,a)
                            ):
                        yield "".join(flatten(letras))

comb_noNN生成遵循規則3的特定大小的char的所有組合,然后在my_string檢查加起來N的所有長度組合,並通過單獨生成每個數字,小寫和大寫字母生成遵循規則1的所有字符串。

for i,x in enumerate(my_string())一些輸出for i,x in enumerate(my_string())

0, '0101010'
...
100, '0101231'
...
491041580, '936gzrf'
...
758790032, '27ktxfi' 
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM