I have a requirement of tokenizing the words in a sentence based on the specific word list.
wordlist = ["nlp - nltk", "CIFA R12 - INV"]
Example-input: This is sample text for nlp - nltk CIFA R12 - INV
.
while using word_tokenize(Exapmle-input), here I need nlp - nltk
as one token and CIFA R12 - INV
as another token. Is that possible rather than getting nlp
-
CIFA
as different tokens?
For those who comes here in future:-
After some reading, i have found out nltk.tokenize.mwe module is the option to achieve my above requirement.
Reference: http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.mwe
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.