简体   繁体   中英

Regex pattern with positive lookahead and lookbehind

I have a file looking like:

maar beroepsmensen
    p( maar | <s> )     =  0.005859305 [ -2.232154 ]
    p( beroepsmensen | maar ...)    =  7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
    p( kunnen | beroepsmensen ...)  =  6.842439e-08 [ -5.104295 ]
    p( </s> | kunnen ...)   =  0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04

dan scootermobiel
    p( dan | <s> )  =  0.005859305 [ -2.232154 ]
    p( scootermobiel | dan)     =  0.827746 [ -9.106363 ] # <- second match: 0.827746
    p( he | scootermobiel)  =  0.2520393 [ -3.123365 ]
    p( </s> | he ...)   =  0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs

and a list with some word, eg mylst = ['beroepsmensen', 'scootermobiel'] .

I want to loop through the list and find the first number in the line with the pattern p( ithwordfromlist | anotherword ) = 9.999999999 . (Please see above for the matches concerning the toyexample). Note that the other word after the |can be succeeded by three dots, and that the number sometimes consists of a e- structure.

So far, I managed do write a regex, that finds all numbers in front of the [ with an optional . and optional e- using a positive lookahead:

\\d+(\\.\\d+)?(e-\\d+)?(?=( )+\\[) #the number of spaces after the number can vary too.

However, I failed to write a positive lookbehind that matches the pattern before the number. For instance, a lookbehind like (?<=\\=( )+) elicits the error A quantifier inside a lookbehind makes it non-fixed width . (Maybe using a lookbehind is not the best approach, so please don't hesitate to propose other solutions too.)

Up to now, I split the long file into a list of lines and apply the regex on every element in that list. However, I could of course also apply it on the whole list, if it would be faster. So if you have solutions for both approaches, please let me know, I will compare the runtime then. Thx!!!

Edit: Insert new lines, which start with structure p( word1 | word2 ) and should not be matched

Edit2: Make the question more concrete

How about a regex like this:

\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*

As seen here

All you need to do is extract group 1 from each match.

The complete code would look like this:

import re

pattern = r'\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*'

f = """maar beroepsmensen
    p( maar | <s> )     =  0.005859305 [ -2.232154 ]
    p( beroepsmensen | maar )    =  7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
    p( </s> | beroepsmensen ...)    =  0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04

dan scootermobiel
    p( dan | <s> )  =  0.005859305 [ -2.232154 ]
    p( scootermobiel | dan)     =  0.827746 [ -9.106363 ] # <- second match: 0.827746
    p( </s> | scootermobiel ...)    =  0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs"""

print(re.findall(pattern, f))

The output will be ['7.865118e-06', '0.827746']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM