Tokenize the words based on a list

Question

I have a requirement of tokenizing the words in a sentence based on the specific word list.

wordlist = ["nlp - nltk", "CIFA R12 - INV"]

Example-input: This is sample text for nlp - nltk CIFA R12 - INV .

while using word_tokenize(Exapmle-input), here I need nlp - nltk as one token and CIFA R12 - INV as another token. Is that possible rather than getting nlp - CIFA as different tokens?

Answer 1

For those who comes here in future:-
After some reading, i have found out nltk.tokenize.mwe module is the option to achieve my above requirement.

Reference: http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.mwe

Tokenize the words based on a list

Question

1 answers

solution1
1 ACCPTED 2018-05-07 11:24:33

Tokenize the words based on a list

Question

1 answers

solution1 1 ACCPTED 2018-05-07 11:24:33

solution1
1 ACCPTED 2018-05-07 11:24:33