I am learning regular expressions using python 2.7
Given a sentence(assume lowercase and ascii) such as:
input = 'i like: a, b, 007 and c!!'
How would I tokenize the input string into
['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']
I can write the automata and code the transition matrix in C++, but I would like to do this in python
I am unable to come up with a regex that will match these distinct classes of alphabets, digits and punctuations in one go.
I have seen a couple of stackoverflow posts here and here , but do not quite follow their approaches.
I have tried this for some time now and I would appreciate your help on this.
PS: This is not a homework question
>>> from string import punctuation
>>> text = 'i like: a, b, 007 and c!!'
>>> re.findall('\w+|[{0}]+'.format(punctuation),text)
['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']
This also works but finds any non-whitespace character if it doesn't find alphanumeric characters
>>> re.findall('\w+|\S+',text)
['i', 'like', ':', 'a', ',', 'b', ',', '007', 'and', 'c', '!!']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.