简体   繁体   English

拼写检查器加快

[英]Spell checker speed up

i have this spell checker which i have writted: 我有我写过的这个拼写检查器:

import operator

class Corrector(object):

    def __init__(self,possibilities):
        self.possibilities = possibilities

    def similar(self,w1,w2):
        w1 = w1[:len(w2)]
        w2 = w2[:len(w1)]
        return sum([1 if i==j else 0 for i,j in zip(w1,w2)])/float(len(w1))

    def correct(self,w):
        corrections = {}
        for c in self.possibilities:
            probability = self.similar(w,c) * self.possibilities[c]/sum(self.possibilities.values())
            corrections[c] = probability
        return max(corrections.iteritems(),key=operator.itemgetter(1))[0]

here possibilities is a dictionary like: 这里可能是像这样的字典:

{word1:value1} where value is the number of times the word appeared in the corpus. {word1:value1}其中value是单词出现在语料库中的次数。

The similar function returns the probability of similarity between the words: w1 and w2. 相似函数返回单词w1和w2之间相似的概率。

in the correct function, you see that the software loops through all possible outcomes and then computes a probability for each of them being the correct spelling for w. correct函数中,您会看到该软件遍历所有可能的结果,然后计算出每个结果都是w正确拼写的概率。

can i speed up my code by somehow removing the loop? 我可以通过某种方式删除循环来加快代码速度吗?

now i know there might be no answer to this question, if i can't just tell me that i cant! 现在我知道,如果我不能仅仅告诉我我不能,那么这个问题可能没有答案!

Here you go.... 干得好....

from operator import itemgetter
from difflib import SequenceMatcher

class Corrector(object):

    def __init__(self, possibilities):
        self.possibilities = possibilities
        self.sums = sum(self.possibilities.values())

    def correct(self, word):
        corrections = {}
        sm = SequenceMatcher(None, word, '')
        for w, t in self.possibilities.iteritems():
            sm.b = w
            corrections[w] = sm.ratio() * t/self.sums
        return max(corrections.iteritems(),key=itemgetter(1))[0]

您可以简单地缓存correct的结果,以便在下一个具有相同世界的通话中,您无需任何计算就可以知道答案。

You typically don't want to check the submitted token against all the tokens in your corpus. 通常,您不希望对照语料库中的所有标记检查提交的标记。 The "classic" way to reduce the necessary computations (and thus to reduce the calls in your for loop) is to maintain an index of all the (tri-)grams present in your document collection. 减少必要计算(从而减少for循环中的调用)的“经典”方法是维护文档集合中所有(三)语法的索引。 Basically, you maintain a list of all the tokens of your collection on the one side, and, on the other side, an hash table which keys are the grams, and which values are the index of the tokens in the list. 基本上,您一方面维护列表中所有令牌的列表,另一方面,维护一个哈希表,该哈希表的键是克,值是列表中令牌的索引。 This can be made persistent with a DBM-like database. 可以使用类似DBM的数据库使它持久化。

Then, when it comes about checking the spelling of a word, you split it into grams, search for all the tokens in your collection that contain the same grams, sort them by gram similarity with the submitted token, and then , you perform your distance-computations. 然后,在检查单词的拼写时,将其拆分为克,搜索集合中包含相同克的所有标记,按与提交标记的克相似度对它们进行排序, 然后 ,执行距离-computations。

Also, some parts of your code could be simplified. 此外,您的代码的某些部分可以简化。 For example, this: 例如,这:

def similar(self,w1,w2):
    w1 = w1[:len(w2)]
    w2 = w2[:len(w1)]
    return sum([1 if i==j else 0 for i,j in zip(w1,w2)])/float(len(w1))

can be reduced to: 可以简化为:

def similar(self, w1, w2, lenw1):
    return sum(i == j for i, j in zip(w1,w2)) / lenw1

where lenw1 is the pre-computed length of "w1". 其中lenw1是“ w1”的预先计算的长度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM