Python NLTK tokenize sentence with wrong syntax from human errors

Question

I am looking for a way to handle sentence tokenizing task well.

I have this text extracted from a human written review for a restaurant

Nevertheless, the soup enhances the prawns well.In contrast, the fish offered is fresh and well prepared.

Note that, the period that is the boundary of first sentence is not separated by space. It is result from human error in writing. There are many sentences that were written like this that I can't ignore this one case.

So far I tried nltk sentence tokenizer in python but does not work as expected.

>>>import nltk.data
>>>tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = tokenizer.tokenize(text)
>>>sentences
['Nevertheless, the soup enhances the prawns well.In contrast, the fish offered is fresh and well prepared.']

My expectation is it should be able to split the text into two sentences

['Nevertheless, the soup enhances the prawns well.', 'In contrast, the fish offered is fresh and well prepared.']

Any help is appreciated in advance

Answer 1

I decided to use regex for preprocessing of the text. The regex i use was.

re.sub(r'(\w{2})([.!?]+)(\w+)', r'\1\2 \3', text)

It has 3 groups. Group 1 is before the punctuation (\\w{2}) . Group 2 is the punctuation which can be [!?.] and can repeat more than once so it is ([.!?]{1,}) . Group 3 is the next word after punctuation which can anywhere be 1 or more character word like "I" (\\w{1}) .

Python NLTK tokenize sentence with wrong syntax from human errors

Question

1 answers

solution1
0 2014-08-19 07:39:08

Python NLTK tokenize sentence with wrong syntax from human errors

Question

1 answers

solution1 0 2014-08-19 07:39:08

solution1
0 2014-08-19 07:39:08