I want to tokenize the following sentence with regex tokenizer
MOST INTERESTED IN NUT BUTTERS
When I define my tokenizer as
tokenizer = RegexpTokenizer(r'\w+')
I get output as
['MOST', 'INTERESTED', 'IN', 'NUT', 'BUTTERS']
My desired output is
['MOST', 'INTERESTED', 'IN', 'NUT BUTTERS']
I want NUT BUTTER to be a one element I am not understanding what regular expression to use instead or \\w+
Try split()
instead.
>>> str = 'MOST INTERESTED IN NUT BUTTERS'
>>> str.split(' ', 3) # 3 tells the number of splits it should do.
['MOST', 'INTERESTED', 'IN', 'NUT BUTTERS']
If you want to go with a regex solution you will have to make a list of words that contain spaces that have to be extracted as one and build your regex like this:
word space1|word space2|word space3|...|word spaceN|\w+
for your example it becomes:
NUT BUTTERS|\w+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.