简体   繁体   中英

Is it possible to tokenize all except pre-defined words?

I want to tokenise a sentence but keep the pre-defined words intact. eg

"i went to university of abc and had a wonderful time there!"

into

["i", "went", "to", "university of abc", "and", "had", "a", "wonderful", "time", "there", "!"]

As "university of abc" being the pre-defined words.

I couldn't find such parameter or control in any of the NLTK tokenisers. Any way I can hack to achieve this? Thanks!

You could use the regexp regular expression tokenizer and write a regex that, say, splits on all white space that's not part of "the university of abc." That's going to be a hassle, though- the hack-y approach is probably just to either pass through the text or write a regex that replaces "the university of abc" with "the-university-of-abc" or some other string that won't get broken into separate tokens (depending on which tokenizer you're using).

Rather than split use match using thisregex:

(university of abc|\w+|[^\w\s]+)

RegEx Demo

You can add more pre-defined words at LHS of regex like one shown above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM