简体   繁体   English

在python中查找随机输入字母的单词。 那里使用什么算法/代码?

[英]Finding words from random input letters in python. What algorithm to use/code already there?

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. 我想编写一个字解扰器像这样一个在这里 ,想知道我应该使用来实现这个什么算法。 Also, if anyone can find existing code for this that would be great as well. 此外,如果任何人都可以找到现有的代码,那也很好。 Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. 基本上,功能将像一个boggle求解器,但不是一个矩阵,只是从一串字符中搜索所有单词的可能性。 I do already have adequate dictionaries. 我已经有足够的词典了。

I was planning to do this in either python or ruby. 我打算用python或ruby来做这件事。 Thanks in advance for your help guys! 在此先感谢您的帮助!

I'd use a Trie . 我会用Trie Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber) 这是Python中的一个实现: http//jtauber.com/2005/02/trie.py (归功于James Tauber)

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. 我可能错过了对游戏的理解,但禁止在规则中出现一些复杂情况,例如引入“小丑”(通配符)字母,缺少或附加字母,多个单词等...我认为以下想法会有所帮助这个问题有些相对无趣。 :-( :-(

Main idea index words by the ordered sequence of their letters . 主要思想索引词由其字母的有序序列组成
For example "computer" gets keyed as "cemoprtu". 例如,“计算机”被键入为“cemoprtu”。 Whatever the random drawings provide is sorting in kind, and used as key to find possible matches. 随机图纸提供的是实物分类,并用作查找可能匹配的关键。 Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time , where n is the number of letters (or better, on average due to non-existing words). 使用perimosocordiae建议的trie结构,作为这些排序键和“叶子”节点中相关词(word)/ wordIds的底层存储, 可以在O(n)时间内完成 Word 查找 ,其中n是字母数(或更好,平均由于不存在的单词)。

To further help with indexing we can have several tables/dictionaries, one per number of letters. 为了进一步帮助索引,我们可以有几个表/字典,每个字母数一个。 Also depending on statistics the vowels and consonants could be handled separately. 根据统计数据,元音和辅音可以单独处理。 Another trick would be to have a custom sort order, placing the most selective letters first. 另一个技巧是拥有自定义排序顺序,首先放置最具选择性的字母。

Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination. 对游戏的额外扭曲(例如,从字母的子集中查找单词)主要是迭代 这些字母 幂集并检查每个组合的字典。

A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small. 可以引入一些启发式来帮助修剪一些组合(例如,没有元音[和给定长度]的组合不是可能的解决方案等。应该仔细管理这些启发式,因为查找成本相对较小。

For your dictionary index, build a map (Map[Bag[Char], List[String]]). 对于您的字典索引,构建一个地图(Map [Bag [Char],List [String]])。 It should be a hash map so you can get O(1) word lookup. 它应该是一个哈希映射,因此您可以获得O(1)字查找。 A Bag[Char] is an identifier for a word that is unique up to character order. Bag [Char]是单词的标识符,在字符顺序之前是唯一的。 It's is basically a hash map from Char to Int. 它基本上是从Char到Int的哈希映射。 The Char is a given character in the word and the Int is the number of times that character appears in the word. Char是单词中的给定字符,Int是单词出现在单词中的次数。

Example: 例:

{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]

To find words, take every combination of characters from the input string and look it up in this map. 要查找单词,请从输入字符串中获取每个字符组合,然后在此地图中查找。 The complexity of this algorithm is O(2^n) in the length of the input string. 该算法的复杂度在输入字符串的长度上是O(2 ^ n)。 Notably, the complexity does not depend on the length of the dictionary. 值得注意的是,复杂性并不取决于字典的长度。

This sounds like Rabin-Karp string search would be a good choice. 这听起来像Rabin-Karp字符串搜索将是一个不错的选择。 If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. 如果您使用滚动哈希函数,那么在每个位置您需要一个哈希值更新和一个字典查找。 You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. 您还需要创建一种处理不同单词长度的好方法,例如将所有单词截断为集合中的最短单词并重新检查可能的匹配项。 Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work. 将单词集拆分为单独的长度范围将减少误报的数量,但代价是增加散列工作。

There are two ways to do this. 有两种方法可以做到这一点。 One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. 一种是检查单词中每个候选字母的排列,以查看候选单词是否在您的单词词典中。 That's an O(N!) operation, depending on the length of the word. 这是一个O(N!)操作,取决于单词的长度。

The other way is to check every candidate word in your dictionary to see if it's contained within the word. 另一种方法是检查词典中的每个候选词,看它是否包含在词中。 This can be sped up by aggregating the dictionary; 这可以通过聚合字典来加速; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are. 而不是每个候选词,你一次检查所有相互字谜的单词,因为如果你的单词中包含任何一个单词,那么它们都是。

So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key: 因此,首先要构建一个字典,其字符串是一个排序的字母串,其值是一个字符列表,这些字符是键的字谜:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
        for line in f.readlines():
            if line[0].isupper(): continue
            word = line.strip()
            key = "".join(sorted(word.lower()))
            d[key].append(word)

Now we need a function to see if a word contains a candidate. 现在我们需要一个函数来查看单词是否包含候选。 This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match. 此函数假定单词和候选项都已排序,因此它可以逐个字母地通过它们,并在发现它们不匹配时快速放弃。

>>> def contains(sorted_word, sorted_candidate):
        wchars = (c for c in sorted_word)
        for cc in sorted_candidate:
            while(True):
                try:
                    wc = wchars.next()
                except StopIteration:
                    return False
                if wc < cc: continue
                if wc == cc: break
                return False
        return True

Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list: 现在找到字典中包含的所有候选键,并将它们的所有值聚合到一个列表中:

>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
        if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']

That last step takes about a quarter second on my laptop; 我的笔记本电脑上的最后一步大约需要四分之一秒; there are 195K keys in my dictionary (I'm using the BSD Unix words file). 我的字典中有195K键(我使用的是BSD Unix字文件)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python。 比较单词列表和字母列表 - Python. Compare list of words to list of letters 要在基于随机输入字母的字典中查找单词,此代码会高效执行吗? - To look for words in a dictionary based on random input letters, will this code perform effeciently? Python. 大文本随机字母生成仿句 - Python. Generate imitation of sentence from random letters from large text 如何从生成随机字母集合的代码中提取真实单词 - How to extract real words from a code that generates a random set of letters Python:找到可能由7个随机字母组成的单词 - Python: find possible words that can be made from 7 random letters Python:打印随机文本(取自数据库)并组合2个单词,如果我点击一个按钮? 我已经有代码 - Python: print random text (taken from database) and combine 2 words, if I click a button? I already have code Python。 来自特定单词的随机字母 - Python. Random letter from specific word 如何使用Python仅显示带有元音的单词中的字母 - How to use Python to show only the letters from words with vowels 从随机字母序列中找到单词? - Find words from a sequence of random letters? 在 Python 中使用 NLTK 对单词进行标记的问题。 返回单个字母而不是单词的列表 - Issue with tokenizing words with NLTK in Python. Returning lists of single letters instead of words
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM