简体   繁体   中英

Python NLTK tokenize sentence with wrong syntax from human errors

I am looking for a way to handle sentence tokenizing task well.

I have this text extracted from a human written review for a restaurant

Nevertheless, the soup enhances the prawns well.In contrast, the fish offered is fresh and well prepared.

Note that, the period that is the boundary of first sentence is not separated by space. It is result from human error in writing. There are many sentences that were written like this that I can't ignore this one case.

So far I tried nltk sentence tokenizer in python but does not work as expected.

>>>import nltk.data
>>>tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = tokenizer.tokenize(text)
>>>sentences
['Nevertheless, the soup enhances the prawns well.In contrast, the fish offered is fresh and well prepared.']

My expectation is it should be able to split the text into two sentences

['Nevertheless, the soup enhances the prawns well.', 'In contrast, the fish offered is fresh and well prepared.']

Any help is appreciated in advance

I decided to use regex for preprocessing of the text. The regex i use was.

re.sub(r'(\w{2})([.!?]+)(\w+)', r'\1\2 \3', text)

It has 3 groups. Group 1 is before the punctuation (\\w{2}) . Group 2 is the punctuation which can be [!?.] and can repeat more than once so it is ([.!?]{1,}) . Group 3 is the next word after punctuation which can anywhere be 1 or more character word like "I" (\\w{1}) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM