I need to find indices at which a lowercase word with letters az occurs in a string. However, the string might have a bunch of non-alpha characters within the word.
For example, the word "dont" spans indices [0, 5) in the phrase "don't do that."
I searched around for ways to match non-alpha characters and achieved this with the following regex:
>>> import re
>>> pattern = re.compile("d[^a-z]*o[^a-z]*n[^a-z]*t[^a-z]*")
>>> test = "don't"
>>> pattern.search(test).start()
0
>>> pattern.search(test).end()
5
>>> test = "d'o&&&&&n't"
>>> pattern.search(test).start()
0
>>> pattern.search(test).end()
11
>>>
Is there a more concise way to express this regex? Or would I have to write code to insert [^az]* between every character in every word I want to search for?
Sorry if this question already exists - I don't know exactly how to phrase this question. Thanks for the help.
You can match for every lowercase word like that, using repetition under a non-capturing group:
(?:[a-z][^a-z]*)+
Alternatively, you can automate this regex for every given word:
>>> word = 'dont'
>>> regex = ''.join(x + '[^a-z]*' for x in word)
>>> regex
'd[^a-z]*o[^a-z]*n[^a-z]*t[^a-z]*'
Yes, you will have to do it the way you showed if it is really your intention.
A regex only matches consequent sequences of specific chars or types of chars. It cannot know that you need to match d&&o
with d
and o
only, since there are other chars that must be matched.
尝试这个:
pattern = re.compile("[^\w']|don't")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.