简体   繁体   中英

issues with sentence detection using nltk

I have issues with nltk recognizing this as one sentence, because of the exclamation mark in the quotation marks.

s = "Donc ce n'est pas non plus de vous dire « Allez absolument ici ! », non."

I tried:

from nltk.tokenize import sent_tokenize
sent_tokenize(s, language='french')

but I get:

["Donc ce n'est pas non plus de vous dire « Allez absolument ici,", '». non.']

I am wondering if there is a better sentence detection method out there?

As someone commented below, you need it to handle other delimiters. Unfortunately, your example has an. which will automatically split irrespective of if you find a better tokeniser or not.

I have added another method that helps with multiple delimiters.

s = "Donc ce n'est pas non plus de vous dire « Allez absolument ici ! », non. hi there this is another sentence"

ss = s.split('.)
ss

["Donc ce n'est pas non plus de vous dire « Allez absolument ici ! », non",
 ' hi there this is another sentence']

Or you can use re.split for multiple delimiters

ss = re.split('[!.]',s)
["Donc ce n'est pas non plus de vous dire « Allez absolument ici ",
 ' », non',
 ' hi there this is another sentence']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM