How does one keep certain strings together in the following? For example,
sentence = "?!a# .see"
tokens = nltk.word_tokenize(sentence)
tokens
gives
['!', '?', 'a', '#', '.see'] rather than keeping '?!a#' as one entity.
Seems like what you want to do is to split the string with whitespace. So just calling split would suffice:
>>> sentence.split()
['?!a#', '.see']
However if you really want to use a tokenizer, you can use a Regexp tokenizer:
>>> word_tokenizer = RegexpTokenizer('[\S]+')
>>> word_tokenizer.tokenize(sentence)
['?!a#', '.see']
'\\S'
matches any non-whitespace character.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.