I have a file looking like:
maar beroepsmensen
p( maar | <s> ) = 0.005859305 [ -2.232154 ]
p( beroepsmensen | maar ...) = 7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
p( kunnen | beroepsmensen ...) = 6.842439e-08 [ -5.104295 ]
p( </s> | kunnen ...) = 0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04
dan scootermobiel
p( dan | <s> ) = 0.005859305 [ -2.232154 ]
p( scootermobiel | dan) = 0.827746 [ -9.106363 ] # <- second match: 0.827746
p( he | scootermobiel) = 0.2520393 [ -3.123365 ]
p( </s> | he ...) = 0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs
and a list with some word, eg mylst = ['beroepsmensen', 'scootermobiel']
.
I want to loop through the list and find the first number in the line with the pattern p( ithwordfromlist | anotherword ) = 9.999999999
. (Please see above for the matches concerning the toyexample). Note that the other word after the |
can be succeeded by three dots, and that the number sometimes consists of a e-
structure.
So far, I managed do write a regex, that finds all numbers in front of the [
with an optional .
and optional e-
using a positive lookahead:
\\d+(\\.\\d+)?(e-\\d+)?(?=( )+\\[) #the number of spaces after the number can vary too.
However, I failed to write a positive lookbehind that matches the pattern before the number. For instance, a lookbehind like (?<=\\=( )+)
elicits the error A quantifier inside a lookbehind makes it non-fixed width . (Maybe using a lookbehind is not the best approach, so please don't hesitate to propose other solutions too.)
Up to now, I split the long file into a list of lines and apply the regex on every element in that list. However, I could of course also apply it on the whole list, if it would be faster. So if you have solutions for both approaches, please let me know, I will compare the runtime then. Thx!!!
Edit: Insert new lines, which start with structure p( word1 | word2 )
and should not be matched
Edit2: Make the question more concrete
How about a regex like this:
\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*
As seen here
All you need to do is extract group 1 from each match.
The complete code would look like this:
import re
pattern = r'\s*p\s*\(\s*\w+\s*\|\s*\w+\s*\)\s*=\s*([\de\-\.]+)\s*\[\s*[\-\.\de]+\s*\]\s*'
f = """maar beroepsmensen
p( maar | <s> ) = 0.005859305 [ -2.232154 ]
p( beroepsmensen | maar ) = 7.865118e-06 [ -5.104295 ] # <- first match: 7.865118e-06
p( </s> | beroepsmensen ...) = 0.04018713 [ -1.395913 ]
1 sentences, 2 words, 0 OOVs
0 zeroprobs, logprob= -8.732362 ppl= 814.3052 ppl1= 23237.04
dan scootermobiel
p( dan | <s> ) = 0.005859305 [ -2.232154 ]
p( scootermobiel | dan) = 0.827746 [ -9.106363 ] # <- second match: 0.827746
p( </s> | scootermobiel ...) = 0.04499642 [ -1.346822 ]
1 sentences, 2 words, 0 OOVs"""
print(re.findall(pattern, f))
The output will be ['7.865118e-06', '0.827746']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.