简体   繁体   中英

Python Regex Search for Word with Non-alpha Characters in the Middle

I need to find indices at which a lowercase word with letters az occurs in a string. However, the string might have a bunch of non-alpha characters within the word.

For example, the word "dont" spans indices [0, 5) in the phrase "don't do that."

I searched around for ways to match non-alpha characters and achieved this with the following regex:

>>> import re
>>> pattern = re.compile("d[^a-z]*o[^a-z]*n[^a-z]*t[^a-z]*")
>>> test = "don't"
>>> pattern.search(test).start()
0
>>> pattern.search(test).end()
5
>>> test = "d'o&&&&&n't"
>>> pattern.search(test).start()
0
>>> pattern.search(test).end()
11
>>>

Is there a more concise way to express this regex? Or would I have to write code to insert [^az]* between every character in every word I want to search for?

Sorry if this question already exists - I don't know exactly how to phrase this question. Thanks for the help.

You can match for every lowercase word like that, using repetition under a non-capturing group:

(?:[a-z][^a-z]*)+

Alternatively, you can automate this regex for every given word:

>>> word = 'dont'
>>> regex = ''.join(x + '[^a-z]*' for x in word)
>>> regex
'd[^a-z]*o[^a-z]*n[^a-z]*t[^a-z]*'

Yes, you will have to do it the way you showed if it is really your intention.

A regex only matches consequent sequences of specific chars or types of chars. It cannot know that you need to match d&&o with d and o only, since there are other chars that must be matched.

尝试这个:

pattern = re.compile("[^\w']|don't")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM