Is it possible to tokenize all except pre-defined words?

Question

I want to tokenise a sentence but keep the pre-defined words intact. eg

"i went to university of abc and had a wonderful time there!"

into

["i", "went", "to", "university of abc", "and", "had", "a", "wonderful", "time", "there", "!"]

As "university of abc" being the pre-defined words.

I couldn't find such parameter or control in any of the NLTK tokenisers. Any way I can hack to achieve this? Thanks!

Answer 1

You could use the regexp regular expression tokenizer and write a regex that, say, splits on all white space that's not part of "the university of abc." That's going to be a hassle, though- the hack-y approach is probably just to either pass through the text or write a regex that replaces "the university of abc" with "the-university-of-abc" or some other string that won't get broken into separate tokens (depending on which tokenizer you're using).

Answer 2

Rather than split use match using thisregex:

(university of abc|\w+|[^\w\s]+)

RegEx Demo

You can add more pre-defined words at LHS of regex like one shown above.

Is it possible to tokenize all except pre-defined words?

Question

2 answers

solution1
1 2015-10-19 15:36:14

solution2
1 ACCPTED 2015-10-19 15:39:41

Is it possible to tokenize all except pre-defined words?

Question

2 answers

solution1 1 2015-10-19 15:36:14

solution2 1 ACCPTED 2015-10-19 15:39:41

solution1
1 2015-10-19 15:36:14

solution2
1 ACCPTED 2015-10-19 15:39:41