Tokenizing unsplit words from OCR using NLTK

Question

I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I , or thatposition instead of that position , or andhe's instead of and he's .

My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?

Answer 1

I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here . Here is an example of how you would obtain your results after you install it:

>>> text = "IfI am inthat position, Idon't think I will."  # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
    for suggestion in error.suggest():
        if error.word.replace(' ', '') == suggestion.replace(' ', ''):  # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
            error.replace(suggestion)
            break
>>> checker.get_text()
"If I am in that position, I don't think I will."  # text is now fixed

Tokenizing unsplit words from OCR using NLTK

Question

1 answers

solution1
5 ACCPTED 2014-04-26 21:08:28

Tokenizing unsplit words from OCR using NLTK

Question

1 answers

solution1 5 ACCPTED 2014-04-26 21:08:28

solution1
5 ACCPTED 2014-04-26 21:08:28