简体   繁体   English

使用ntlk的python拼写纠正器

[英]python spell corrector using ntlk

I am trying to write a spell corrector in python for a corpus of tweets i have (I am new to python and nltk). 我正在尝试在python中编写一个拼写纠正器,用于我的推文语料库(我是python和nltk的新手)。 The tweets are in xml format and are tokenised. 这些推文采用xml格式并被标记化。 I have tried using the enchant.checker SpellingCorrector but seem to be getting a bug with it: 我尝试过使用enchant.checker SpellingCorrector,但似乎得到了一个bug:

>>> text = "this is sme text with a speling mistake."
>>> from enchant.checker import SpellChecker
>>> chkr = SpellChecker("en_US", text)
>>> for err in chkr:
...     err.replace("SPAM")
... 
>>> chkr.get_text()
'this is SPAM text with a SPAMSSPSPAM.SSPSPAM'

when it should return "this is some text with a spelling mistake." 当它应该返回“这是一些拼写错误的文本。”

I have also written a spell corrector for single words that I am happy with but I am struggling to work out how to parse over the tokenised tweet files to get this to work: 我还为单个单词编写了一个拼写纠正器,我很满意,但我正在努力解决如何解析标记化的推文文件以使其工作:

def __init__(self, dict_name='en', max_dist=2):
        self.spell_dict = enchant.Dict('en_GB')
        self.max_dist = max_dist

    def replace(self, word):
        if self.spell_dict.check(word):
            return word

        suggestions = self.spell_dict.suggest(word)

        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

Can anybody help me at all please? 有人可以帮我吗?

Thanks 谢谢

I saw your post and thought I'd do some playing around with it. 我看到你的帖子,并认为我会做一些游戏。 This is what I got. 这就是我得到的。

I added a few print statements to see what was going on: 我添加了一些打印语句来查看发生了什么:

from enchant.checker import SpellChecker

text = "this is sme text with a speling mistake."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))  #<----
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)  #<----

and this is the result of running the code: 这是运行代码的结果:

sme at position 8
speling at position 25
ing at position 29
ng at position 30
AMMstake at position 32
ake at position 37
ke at position 38
AMM at position 40

this is SPAM text with a SPAMSSPSPAM.SSPSPAM

As you can see, as the mispelled words are replaced by "SPAM", the spell checker seems to be dynamically changing, and checking the original text in that it is including parts of "SPAM" in the err var. 正如您所看到的,当拼写错误的单词被“垃圾邮件”取代时,拼写检查器似乎在动态变化,并检查原始文本,因为它包含错误变量中的“垃圾邮件”部分。

I tried the original code from http://pythonhosted.org/pyenchant/api/enchant.checker.html , with the example it looks like you used for you question and still got some unexpected results. 我尝试了来自http://pythonhosted.org/pyenchant/api/enchant.checker.html的原始代码,示例看起来就像你用来问你的问题,但仍然有一些意想不到的结果。

Note: the only thing I added was the print statements: 注意:我添加的唯一内容是print语句:

Orinal: Orinal:

>>> text = "This is sme text with a fw speling errors in it."
>>> chkr = SpellChecker("en_US",text)
>>> for err in chkr:
...   err.replace("SPAM")
...
>>> chkr.get_text()
'This is SPAM text with a SPAM SPAM errors in it.'

My Code: 我的代码:

from enchant.checker import SpellChecker

text = "This is sme text with a fw speling errors in it."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)

The output did not match the website: 输出与网站不符:

sme at position 8
fw at position 25
speling at position 30
ing at position 34
ng at position 35
AMMrors at position 37  #<---- seems to add in parts of "SPAM"

This is SPAM text with a SPAM SPAMSSPSPAM in it.  #<---- my output ???

Anyway, here's something I came up with that solves some of the problem. 无论如何,这是我提出的解决一些问题的东西。 Instead of replacing with "SPAM", I use a version of the code you posted for single word replacement and replace with an actual suggested word. 我没有替换为“垃圾邮件”,而是使用您发布的代码版本进行单字替换,并替换为实际建议的单词。 It is important to note here that the "suggested" word is wrong 100% of the time in this example. 重要的是要注意,在这个例子中,“建议的”字在100%的时间是错误的。 I've run accross this issue in the past, "How to implement spelling correction without user interaction." 我过去经常遇到这个问题,“如何在没有用户交互的情况下实现拼写纠正。” The scope of that would be far beyond you're question. 这个范围远远超出你的要求。 But, I think you're going to need a few array of NLP to get accurate results. 但是,我认为你需要一些NLP来获得准确的结果。

import enchant
from enchant.checker import SpellChecker
from nltk.metrics.distance import edit_distance

class MySpellChecker():

    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = max_dist

    def replace(self, word):
        suggestions = self.spell_dict.suggest(word)

        if suggestions:
            for suggestion in suggestions:
                if edit_distance(word, suggestion) <= self.max_dist:
                    return suggestions[0]

        return word


if __name__ == '__main__':
    text = "this is sme text with a speling mistake."

    my_spell_checker = MySpellChecker(max_dist=1)
    chkr = SpellChecker("en_US", text)
    for err in chkr:
        print(err.word + " at position " + str(err.wordpos))
        err.replace(my_spell_checker.replace(err.word))

    t = chkr.get_text()
    print("\n" + t)

The problem with your spellchecker is the line 您的拼写检查程序的问题是该行

err.replace("SPAM")

You want to feed the misspelled word to the function, ie 您希望将拼写错误的单词提供给函数,即

err.replace(err.word)

For every error word pointed out, instead of replacing it..Just do the following : 1. Encounter the error word : Ex : "sme" 2. Seek suggestions for this error word : Ex : a = enchant.suggest("sme") 3. Assuming Enchant suggests correctly, use err.replace(a[0]) 对于指出的每个错误单词,而不是替换它。只需执行以下操作:1。遇到错误单词:Ex:“sme”2。寻找此错误单词的建议:例如:a = enchant.suggest(“sme” )3。假设附魔建议正确,使用err.replace(a [0])

Hope it works for you. 希望对你有效。 But honestly, Enchant that internally uses Aspell, etc. are not that accurate. 但老实说,内部使用Aspell等的附魔不是那么准确。 Which is why, I too am in need of a OPEN SOURCE Spell Checker, which can address atleast 85% of spell checks accurately. 这就是为什么,我也需要一个开源的拼写检查器,它可以准确地解决至少85%的拼写检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM