Tokenize with Regex Tokenizer

Question

I want to tokenize the following sentence with regex tokenizer

MOST INTERESTED IN NUT BUTTERS

When I define my tokenizer as

tokenizer = RegexpTokenizer(r'\w+')

I get output as

['MOST', 'INTERESTED', 'IN', 'NUT', 'BUTTERS']

My desired output is

['MOST', 'INTERESTED', 'IN', 'NUT BUTTERS']

I want NUT BUTTER to be a one element I am not understanding what regular expression to use instead or \\w+

Answer 1

Try split() instead.

>>> str = 'MOST INTERESTED IN NUT BUTTERS'
>>> str.split(' ', 3) # 3 tells the number of splits it should do.
['MOST', 'INTERESTED', 'IN', 'NUT BUTTERS']

Answer 2

If you want to go with a regex solution you will have to make a list of words that contain spaces that have to be extracted as one and build your regex like this:

word space1|word space2|word space3|...|word spaceN|\w+

for your example it becomes:

NUT BUTTERS|\w+

Tokenize with Regex Tokenizer

Question

2 answers

solution1
0 2017-12-19 06:53:40

solution2
0 ACCPTED 2017-12-19 06:56:08

Tokenize with Regex Tokenizer

Question

2 answers

solution1 0 2017-12-19 06:53:40

solution2 0 ACCPTED 2017-12-19 06:56:08

solution1
0 2017-12-19 06:53:40

solution2
0 ACCPTED 2017-12-19 06:56:08