简体   繁体   中英

Multiple Positive Lookbehind Regex

so I'm practicing my regex and I encounter this

STATE :   TEXAS

im going for a positive lookbehind

this is my regex:

state = re.search(r"(?<=STATE)\s+(?<=:)\s+\w+",str(Text),re.I|re.M)

this regex fails to capture TEXAS

however If I do this:

state = re.search(r"(?<=STATE)\s+:\s+\w+",str(Text),re.I|re.M)

removing the second positive lookbehind will give you : TEXAS

however all I want to extract is TEXAS without the colon why does the second look behind fail to capture TEXAS ? and how can it be fixed?

Think about this part of your pattern:

(?<=STATE)\s+(?<=:)

The first lookbehind says to find a place with "STATE" right before it. The \\s+ says to match some whitespace. The second lookbehind says to look behind (at what you have just matched) and find a colon. That's impossible, because all you've matched is spaces. You can't look back and find a colon without consuming it during the match.

A lookbehind in the middle of your expression doesn't mean "skip ahead until you get past this part". It means to look back over what has already been matched and see if it matches the lookbehind expression. It can only match against stuff that has already been consumed (unless it's at the beginning of your regex, in which case it will control where the match begins),

If you just want to get "TEXAS", you should capture it in a group and then extract the group after doing the match:

>>> data = "STATE :   TEXAS"
>>> re.search("STATE\s+:\s+(\w+)", data).group(1)
'TEXAS'

Don't use lookahead/lookbehind; use groups instead. (I really wish someone had told me this when I first learnt regex!):

re.search('STATE\s+:\s+(\w+)', "STATE :   TEXAS").group(1)
Out[145]: 'TEXAS'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM