简体   繁体   中英

Regex look-ahead with non-capturing group not working as intended

Below I have text from which I want to extract the month (July in this case). The word_pattern makes sure that the text contains those words, while the month_pattern will extract the month. So first I verify text passage contains certain words, and if it does, then I attempt to extract the month

When the patterns are used separately, they get a match, but if I try to combine them I end up with no matches. I can't figure out what I'm doing wrong.

import re

text = ''' The number of shares of the
registrant’s common stock outstanding as
of July 31, 2017 was 52,833,429.'''

# patterns
word_pattern = r'(?=.*outstanding[.,]?)(?=.*common)(?=.*shares)'

month_pattern = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)'


pattern = word_pattern + month_pattern

print(re.search(pattern, text, flags = re.IGNORECASE|re.DOTALL))

Expected result:

<re.Match object; span=(73, 77), match='July'>

Regex cannot be easily concatenated like that. The issue is your word pattern only uses lookaheads and therefore does not move the position ahead which becomes a problem when the month only shows up mid-string. So, you need to allow the cursor to advance to the month position using a quantifier that bridges the gap, eg .* Try

(?=.*outstanding[.,]?)(?=.*common)(?=.*shares).*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)

Demo

Or pattern = word_pattern +'.*'+ month_pattern should do the trick.

The result can be found in capture group 1: re.search(...).group(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM