Python RegEx- 刽子手算法

Question

I am trying to write a hangman algorithm.我正在尝试编写一个刽子手算法。 My idea for it goes like this:我的想法是这样的：

Pre-process a dictionary that contains the relative letter frequencies of words depending on their length.预处理包含单词的相对字母频率的字典，具体取决于单词的长度。 Step complete.步骤完成。

Example:例子：

#Each key corresponds to length of the word.   

frequencyDict = {2: ['a', 'o', 'e', 'i', 'm', 'h', 'n', 'u', 's', 't', 'y', 'b', 'd', 'l', 'p', 'x', 'f', 'r', 'w', 'g', 'k', 'j'], 
  3: ['a', 'e', 'o', 'i', 't', 's', 'u', 'p', 'r', 'n', 'd', 'b', 'm', 'g', 'y', 'l', 'h', 'w', 'f', 'c', 'k', 'x', 'v', 'j', 'z', 'q'], 
  4: ['e', 'a', 's', 'o', 'i', 'l', 'r', 't', 'n', 'u', 'd', 'p', 'm', 'h', 'b', 'c', 'g', 'k', 'y', 'f', 'w', 'v', 'j', 'z', 'x', 'q'],
  5: ['s', 'e', 'a', 'o', 'r', 'i', 'l', 't', 'n', 'd', 'u', 'c', 'p', 'y', 'm', 'h', 'g', 'b', 'k', 'f', 'w', 'v', 'z', 'x', 'j', 'q'],
  6: ['e', 's', 'a', 'r', 'i', 'o', 'l', 'n', 't', 'd', 'u', 'c', 'p', 'm', 'g', 'h', 'b', 'y', 'f', 'k', 'w', 'v', 'z', 'x', 'j', 'q'],
  7: ['e', 's', 'a', 'i', 'r', 'n', 'o', 't', 'l', 'd', 'u', 'c', 'g', 'p', 'm', 'h', 'b', 'y', 'f', 'k', 'w', 'v', 'z', 'x', 'j', 'q'],
  8: ['e', 's', 'i', 'a', 'r', 'n', 'o', 't', 'l', 'd', 'c', 'u', 'g', 'p', 'm', 'h', 'b', 'y', 'f', 'k', 'w', 'v', 'z', 'x', 'q', 'j']}

I also have a generator of words in a dictionary:我还有一个字典中的单词生成器：

dictionary = word_reader('C:\\Python27\\dictionary.txt', len(letters))

Which is based on this function这是基于这个 function

#Strips dictionary of words that are too big or too small from the list
def word_reader(filename, L):
  L2 = L+2
  return (word.strip() for word in open(filename) \
          if len(word) < L2 and len(word) > 2)

This particular game will give you the last vowel for free.这个特殊的游戏将免费为您提供最后一个元音。 If the word was earthen, for example, the user would be given the following board: e----e- to guess.例如，如果这个词是土的，那么用户将得到以下板：e----e- 猜测。 So, I want to find a way to create a new generator or list with all the words stripped out of it that do not conform to the e----e- template.所以，我想找到一种方法来创建一个新的生成器或列表，其中删除了所有不符合 e----e- 模板的单词。

p = re.compile('^e\D\D\D\De\D$', re.IGNORECASE) will do it, but it might find words that contain 'e's in other places besides the first letter and second to last letter. p = re.compile('^e\D\D\D\De\D$', re.IGNORECASE)会这样做，但它可能会在除第一个字母和倒数第二个字母之外的其他位置找到包含 'e' 的单词信。

So my first question is:所以我的第一个问题是：

How do I ensure that an 'e' is located ONLY in the first and the second-to-last position如何确保“e”仅位于第一个和倒数第二个 position
How do I do create an intelligent function that will have a new regex as the puzzle updates and the computer keeps making its guesses?我如何创建一个智能 function 随着谜题更新和计算机不断猜测而具有新的正则表达式？

For example, if the word is monkey, the computer would just be given ----e- The first step would be for it to strip from its dictionary all words that are not 6 letters, and all words that do not conform perfectly to the '----e-' template and put that in a newList.例如，如果单词是猴子，则计算机将只给出 -e- 第一步是让它从字典中删除所有不是 6 个字母的单词，以及所有不完全符合的单词'----e-' 模板并将其放入新列表中。 How do I go about doing this?我该怎么做呢？

It then computes a NEW frequencyDict based on the relative frequency of words that are in its newList.然后它根据 newList 中单词的相对频率计算一个 NEW frequencyDict。

My current method of doing this looks like this:我目前的做法是这样的：

   cnt = Counter()
   for words in dictionary:
      for letters in words:
         cnt[letters]+=1

Is this the most efficient way?这是最有效的方法吗？

It would then use the newfrequencyDict to guess the most common letter, assuming it has not already been guessed.然后它会使用 newfrequencyDict 来猜测最常见的字母，假设它还没有被猜到。 It continues to do this until (hopefully) the word is guessed.它会继续这样做，直到（希望）这个词被猜到为止。

Is this an efficient algorithm?这是一个有效的算法吗？ Are there better implementations?有更好的实现吗？

Answer 1

That's quite a lot of questions.这是相当多的问题。 I'll try to answer a few.我会试着回答几个。

Your regex should look more like this: ' ^e[^e][^e][^e][^e]e[^e]$ '.您的正则表达式应该看起来更像这样：' ^e[^e][^e][^e][^e]e[^e]$ '。 Those [^e] bits say "match any character that is not 'e'. Note that unlike your regex, this will mach non-letter characters, but that shouldn't be a problem if you make sure your dictionary has only letters. Note that once you have uncovered more than one letter, you would put all the letters into each of those "don't match" sections. For example, say that the 'a' is guessed, so it's "ea---e-", now you will match with the regex ' ^ea[^ae][^ae][^ae]e[^ae]$ '.那些[^e]位表示“匹配任何不是'e'的字符。请注意，与您的正则表达式不同，这将处理非字母字符，但如果您确保您的字典只有字母，那应该不是问题。请注意，一旦您发现了多个字母，您会将所有字母放入每个“不匹配”部分。例如，假设“a”是猜测的，所以它是“ea---e- "，现在你将匹配正则表达式' ^ea[^ae][^ae][^ae]e[^ae]$ '。
You could simply write a function that takes a string such as "ea---e-" and builds a regex from it.您可以简单地编写一个 function ，它采用诸如“ea---e-”之类的字符串并从中构建一个正则表达式。 It would simply need to a) find all of the non-hyphen letters in the string, as a set (in this case, {'a', 'e'} ), b) flatten the set into a "match-all-but-this" regex fragment ( [^ae] ) -- note that the order is not important which is why I used a set, c) substitute each hyphen with one of those ( ea[^ae][^ae][^ae]e[^ae] ), and d) finally just put a ' ^ ' at the front and ' $ ' at the end.它只需要 a) 将字符串中的所有非连字符字母作为一个集合（在本例中为{'a', 'e'} ），b) 将集合展平为“匹配所有-但是-this” 正则表达式片段（ [^ae] ）-请注意，顺序并不重要，这就是我使用集合的原因，c）用其中一个连字符替换每个连字符（ ea[^ae][^ae][^ae]e[^ae] ) 和 d) 最后只在前面放一个' ^ '，最后放一个' $ '。
Lastly with the frequency dict -- well that is a very separate question.最后是频率字典——这是一个非常独立的问题。 It's hard to get more efficient than a linear search through the whole dictionary.很难比在整个字典中进行线性搜索更有效。 One suggestion I would make is that you possibly shouldn't be counting letters multiple times.我会提出的一个建议是，您可能不应该多次数字母。 For example, do you want the word "earthen" to contribute 2 points towards the letter count for 'e'?例如，您是否希望单词“earthen”对“e”的字母计数贡献 2 分？ I would guess in Hangman that you only want it to count once, since the word "eeeeeeee" and the word "the" both have the same outcome for guessing the letter 'e' (success).我猜在 Hangman 中你只希望它计算一次，因为单词“eeeeeeee”和单词“the”对于猜测字母“e”（成功）具有相同的结果。 But I could be wrong.但我可能是错的。

Answer 2

There's nothing particularly magical about regexes, and matching them against your whole dictionary is still going to take O(n) time.正则表达式没有什么特别神奇的地方，将它们与整个字典进行匹配仍然需要 O(n) 时间。 I'd recommend writing your own function that determines if a word is a match for a template, and running your dictionary-so-far through that.我建议您编写自己的 function 来确定一个单词是否与模板匹配，并通过它运行您的字典。

Here's an example function:这是一个示例 function：

def matches_template(word, template):
  found_chars = set(x for x in template if x != '-')
  for char, template_char in zip(word, template):
    if template_char == '-':
      if char in found_chars: return False
    else:
      if template_char != char: return False
  return True

As far as determining the next character to guess, you probably don't want to select the most frequent character.至于确定下一个要猜测的字符，您可能不想 select 出现频率最高的字符。 Instead, you want to select the character that comes closest to being in 50% of words, meaning you eliminate the most possibilities either way.相反，您想要 select 最接近出现在 50% 单词中的字符，这意味着无论哪种方式，您都消除了最多的可能性。 Even that isn't optimal - it could be that certain characters are more likely to occur twice in the word, and therefore eliminate a larger proportion of candidates - but it's closer.即使这样也不是最佳的——可能是某些字符更有可能在单词中出现两次，因此消除了更大比例的候选者——但它更接近。

Python RegEx- 刽子手算法

问题描述

2 个解决方案

解决方案1
3 2011-06-14 05:59:12

解决方案2
2 已采纳 2011-06-14 06:29:01

Python RegEx- 刽子手算法

问题描述

2 个解决方案

解决方案1 3 2011-06-14 05:59:12

解决方案2 2 已采纳 2011-06-14 06:29:01

解决方案1
3 2011-06-14 05:59:12

解决方案2
2 已采纳 2011-06-14 06:29:01