简体   繁体   English

优化查找字符串的所有排列的方式

[英]Optimizing a way to find all permutations of a string

I solved a puzzle but need to optimize my solution. 我解决了一个难题,但需要优化解决方案。 The puzzle says that I am to take a string S , find all permutations of its characters, sort my results, and then return the one-based index of where S appears in that list. 难题在于,我要获取字符串S ,查找其字符的所有排列,对我的结果进行排序,然后返回基于S的索引,该索引显示S在该列表中的位置。

For example, the string 'bac' appears in the 3rd position in the sorted list of its own permutations: ['abc', 'acb', 'bac', 'bca', 'cab', 'cba'] . 例如,字符串'bac'出现在其自身排列的排序列表的第三位: ['abc', 'acb', 'bac', 'bca', 'cab', 'cba']

My problem is that the puzzle limits my execution time to 500ms. 我的问题是难题将我的执行时间限制为500ms。 One of the test cases passed "BOOKKEEPER" as an input, which takes ~4.2s for me to complete. 其中一个测试用例通过了“ BOOKKEEPER”作为输入,我花了大约4.2s的时间来完成。

I took a (possibly naive) dynamic programming approach using memoization using a dict keyed by one particular permutation of some character set, but that's not enough. 我采用了一种(可能是幼稚的)动态编程方法,该方法使用了记忆,该记忆使用的dict是由某个字符集的一个特定排列所键控的,但这还不够。

What is my bottleneck? 我的瓶颈是什么?

I'm profiling in the meantime to see if I can answer my own question, but I invite those who see the problem outright to help me understand how I slowed this down. 同时,我正在分析以查看是否可以回答自己的问题,但是我邀请那些直接看到问题的人来帮助我了解我如何放慢速度。

EDIT: My solution appears to outperform itertools.permutations . 编辑:我的解决方案似乎胜过itertools.permutations 10+ seconds for input "QUESTION". 输入“ QUESTION”超过10秒。 But to be fair, this includes time printing so this might not be a fair comparison. 但是为了公平起见,这包括时间打印,因此这可能不是一个公平的比较。 Even so, I'd rather submit a handcoded solution with competitive performance knowing why mine was worse than to opt for a module. 即使这样,我还是宁愿提交一个具有竞争性能的手动编码解决方案,因为知道为什么我的情况比选择一个模块更糟。

memo = {}

def hash(word):
    return ''.join(sorted(word))

def memoize(word, perms):
    memo[hash(word)] = perms
    return perms

def permutations(word, prefix = None):
    """Return list of all possible permutatons of given characters"""
    H = hash(word)

    if H in memo:
        return [s if prefix is None else prefix + s for s in memo[H]]

    L = len(word)

    if L == 1:
        return [word] if prefix is None else [prefix + word]

    elif L == 2:
        a = word[0] + word[1]
        b = word[1] + word[0]

        memoize(word, [a, b])

        if prefix is not None:
            a = prefix + a
            b = prefix + b

        return [a, b]

    perms = []
    for i in range(len(word)):
        perms = perms + permutations(word[:i] + word[i+1:], word[i])

    memoize(word, perms)

    return [prefix + s for s in perms] if prefix is not None else perms


def listPosition(word):
  """Return the anagram list position of the word"""
  return sorted(list(set(permutations(word)))).index(word) + 1

print listPosition('AANZ')

I believe the answer is to not produce all the permutations nor sort them. 我相信答案是不要产生所有排列或排序。 Let's keep it simple and see how it compares performance-wise: 让我们保持简单,看看它如何比较性能:

import itertools

def listPosition(string):
    seen = set()

    target = tuple(string)

    count = 1;

    for permutation in itertools.permutations(sorted(string)):
        if permutation == target:
            return count
        if permutation not in seen:
            count += 1
            seen.add(permutation)

print(listPosition('BOOKKEEPER'))

TIMINGS (in seconds) 时间 (以秒为单位)

           Sage/Evert  Mine  Sage     Answer
QUESTIONS     0.02     0.18  0.45      98559
BOOKKEEPER    0.03     0.11  2.10      10743
ZYGOTOBLAST   0.03     24.4  117(*)  9914611

(*) includes ~25 second delay between printing of answer and program completion

The output from Sci Prog's code did not produce answers that agreed with the other two as it produced larger indexes and multiple of them so I didn't include its timings which were lengthy. Sci Prog代码的输出未产生与其他两个一致的答案,因为它产生了较大的索引,并且索引多个,因此我不考虑冗长的时间。

Providing my own answer under the assumption that a good way to optimize code is to not use it in the first place. 在假设优化代码的最佳方法是一开始就不使用它的情况下,提供我自己的答案。 Since I strongly emphasized identifying ways to speed up the code I posted, I'm upvoting everyone else for having made improvements in that light. 由于我强烈强调要找出加快我发布的代码的方式,因此我要对其他所有人为此做出的改进表示赞赏。

@Evert posted the following comment: @Evert发表了以下评论:

I would think you can come up with a formula to calculate the position of the input word, based on the alphabetic ordering (since the list is sorted alphabetically) of the letters. 我认为您可以根据字母的字母顺序(因为列表按字母顺序排序)提出一个公式来计算输入单词的位置。 If I understand the puzzle correctly, it only asks to return the position of the input, not all of the permutations. 如果我正确地理解了难题,它只会要求返回输入的位置,而不是所有排列。 So you'll want to grab some pen and paper and find a formulation of that problem. 因此,您将需要一些笔和纸来寻找解决该问题的方法。

Following this reasoning, among similar suggestions from others, I tried an approach based more in enumerative combinatorics: 根据这种推理,在其他建议中,我尝试了一种基于枚举组合的方法:

from math import factorial as F
from operator import mul

def permutations(s):
    return F(len(s)) / reduce(mul, [F(s.count(c)) for c in set(s)], 1)

def omit(s,index):
    return s[:index] + s[index+1:]

def listPosition(s):
    if (len(s) == 1):
        return 1

    firstletter = s[0]
    predecessors = set([c for c in s[1:] if c < firstletter])
    startIndex = sum([permutations(omit(s, s.index(c))) for c in predecessors])

    return startIndex + listPosition(s[1:])

This produced correct output and passed the puzzle at high speed (performance metrics not recorded, but noticably different). 这样就产生了正确的输出并以高速通过了难题(性能指标未记录,但明显不同)。 Not a single string permutation was actually produced. 实际上没有单个字符串置换产生。

Take as an example input QUESTION : 以输入QUESTION为例:

We know that wherever in the list "QUESTION" appears, it will appear after all permutations that start with letters that come before "Q". 我们知道,无论列表“ QUESTION”出现在哪里,它都会出现在以“ Q”之前的字母开头的所有排列之后。 The same can be said of substrings down the line. 下行中的子字符串也可以这样说。

I find the letters that come before firstletter = 'Q' , which is stored in predecessors . 我发现在firstletter = 'Q'之前的字母存储在predecessors The set prevents double counting for input with repeated letters. set可防止重复计算带有重复字母的输入。

Then, we assume that each letter in predecessors acts as a prefix. 然后,我们假定predecessors中的每个字母都充当前缀。 If I omit that prefix from the string and find the sum of permutations of the remaining letters, we find the number of permutations that must appear before the initial input's first letter . 如果我从字符串中省略该前缀并找到剩余字母的排列之和,我们将找到必须出现在初始输入的第一个字母之前的排列数。 Recurse, then sum the results, and you end up with the start position. 递归,然后对结果求和,最后得到起始位置。

Your bottleneck resides in the fact that the number of permutations of a list of N items is N! 您的瓶颈在于,N个项目的列表的排列数为N! (N factorial). (N阶乘)。 This number grows very fast as the input increases. 随着输入的增加,这个数字增长很快。

The first optimisation you can do is that you do not have to store all the permutations. 您可以做的第一个优化是您不必存储所有排列。 Here is a recursive solution that produces all the permutations already sorted. 这是一个递归解决方案,它产生所有已排序的排列。 The "trick" is to sort the letters of the word before generating the permutations. “技巧”是在生成排列之前对单词的字母进行排序。

def permutations_sorted( list_chars ):
  if len(list_chars) == 1:  # only one permutation for a 1-character string     
    yield list_chars
  elif len(list_chars) > 1:
    list_chars.sort()
    for i in range(len(list_chars)):
      # use each character as first position (i=index)                          
      head_char = None
      tail_list = []
      for j,c in enumerate(list_chars):
        if i==j:
          head_char = c
        else:
          tail_list.append(c)
      # recursive call, find all permutations of remaining                      
      for tail_perm in permutations_sorted(tail_list):
        yield [ head_char ] + tail_perm

def puzzle( s ):
  print "puzzle %s" % s
  results = []
  for i,p_list in enumerate(permutations_sorted(list(s))):
    p_str = "".join(p_list)
    if p_str == s:
      results.append( i+1 )
  print "string %s was seen at position%s %s" % (
    s,
    "s" if len(results) > 1 else "",
    ",".join(["%d" % i for i in results])
  )
  print ""


if __name__ == '__main__':
  puzzle("ABC")       

Still, that program takes a long time to run when the input is large. 但是,当输入较大时,该程序仍需要花费很长时间才能运行。 On my computer (2.5 GHz Intel core i5) 在我的计算机上(2.5 GHz Intel Core i5)

  • Input = "ABC" (3 characters): 0.03 seconds 输入=“ ABC”(3个字符):0.03秒
  • Input = "QUESTION" (8 characters): 0.329 seconds 输入=“问题”(8个字符):0.329秒
  • Input = "QUESTIONS" (9 characters): 2.848 seconds 输入=“问题”(9个字符):2.848秒
  • Input = "BOOKKEEPER" (10 characters): 30.47 seconds 输入=“ BOOKKEEPER”(10个字符):30.47秒

The only way to "beat the clock" is to figure a way to compute the position of the string without generating all the permutations. “击败时钟”的唯一方法是找到一种在生成所有排列的情况下计算字符串位置的方法。

See the comment by Evert above. 请参阅上方Evert的评论。

NB When the input contains letters that are repeated, the initial string is seen at more than one place. 注意当输入包含重复的字母时,初始字符串会出现在多个位置。 I assume you have to report only the first occurence. 我假设您只需要报告第一次发生。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM