I want to tokenise a sentence but keep the pre-defined words intact. eg
"i went to university of abc and had a wonderful time there!"
into
["i", "went", "to", "university of abc", "and", "had", "a", "wonderful", "time", "there", "!"]
As "university of abc"
being the pre-defined words.
I couldn't find such parameter or control in any of the NLTK tokenisers. Any way I can hack to achieve this? Thanks!
You could use the regexp regular expression tokenizer and write a regex that, say, splits on all white space that's not part of "the university of abc."
That's going to be a hassle, though- the hack-y approach is probably just to either pass through the text or write a regex that replaces "the university of abc"
with "the-university-of-abc"
or some other string that won't get broken into separate tokens (depending on which tokenizer you're using).
Rather than split use match using thisregex:
(university of abc|\w+|[^\w\s]+)
You can add more pre-defined words at LHS of regex like one shown above.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.