简体   繁体   English

统计句子建议模型,例如拼写检查

[英]Statistical sentence suggestion model like spell checking

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. 已经有拼写检查模型,可以帮助我们根据经过训练的正确拼写语料库找到建议的正确拼写。 Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases. 可以将粒度从字母增加到“单词”,以便我们甚至可以提供短语建议,这样,如果输入了错误的短语,那么它应该建议正确短语语料中最接近的正确短语,当然,它是从有效短语列表。

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions? 是否有任何已经实现此功能的python库,或如何针对现有的大型金标准短语语料库进行此操作以获取统计上相关的建议?

Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank. 注意:这与拼写检查器不同,因为拼写检查器中的字母是有限的,而短语校正器中的字母本身是一个单词,因此从理论上讲是无限的,但是我们可以限制短语库中单词的数量。

What you want to build is a N-gram model which consist in computing the probability for each word to follow a sequence of n words. 您要构建的是一个N元语法模型,该模型包含计算每个单词遵循n个单词序列的概率。

You can use NLTK text corpora to train your model, or you can tokenize your own corpus with nltk.sent_tokenize(text) and nltk.word_tokenize(sentence) . 您可以使用NLTK文本语料库来训练模型,也可以使用nltk.sent_tokenize(text)nltk.word_tokenize(sentence)标记您自己的语料库。

You can consider 2-gram (Markov model): 您可以考虑2克(马尔可夫模型):

What is the probability for "kitten" to follow "cute"? “小猫”跟随“可爱”的可能性是多少?

...or 3-gram: ...或3克:

What is the probability for "kitten" to follow "the cute"? “小猫”跟随“可爱”的可能性是多少?

etc. 等等

Obviously training the model with n+1-gram is costlier than n-gram. 显然,使用n + 1-gram训练模型比使用n-gram花费更高。

Instead of considering words, you can consider the couple (word, pos) where pos is the part-of-speech tag (you can get the tags using nltk.pos_tag(tokens) ) 除了考虑单词以外,您还可以考虑一对(word, pos) ,其中pos是词性标签(可以使用nltk.pos_tag(tokens)获得标签)

You can also try to consider the lemmas instead of the words. 您也可以尝试考虑引理而不是单词。

Here some interesting lectures about N-gram modelling: 以下是有关N-gram建模的一些有趣的讲座:

  1. Introduction to N-grams N-gram简介
  2. Estimating N-gram Probabilities 估计N-gram概率

This is a simple and short example of code (2-gram) not optimized: 这是一个未经优化的简单代码示例(2克):

from collections import defaultdict
import nltk
import math

ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
    tokens = map(str.lower, nltk.word_tokenize(sentence))
    for token, next_token in zip(tokens, tokens[1:]):
        ngram[token][next_token] += 1
for token in ngram:
    total = math.log10(sum(ngram[token].values()))
    ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM