简体   繁体   中英

Negative lookahead - exclude entire match if words are found?

I am trying to parse text journals, and I am only interested in specific sections of text. I thought that I was doing fine until I noticed I was inadvertently identifying sections.

Suppose that I want to match the following section.

Section 7 - Delivering Terminal Diagnosis's

which may also show up as

Section 7. Delivering a Terminal Diagnosis

But I don't want to match anything if the words see or under precede my string like below.

see Section 7. Delivering a Terminal Diagnosis

or

filed under Section 7. Delivering a Terminal Diagnosis

should not match anything.

I tried using a negative look-ahead, but it only excludes the words, it doesn't throw out the entire match.

((?!see )Section[\s\\n]+7[\s+]+?[-:\\n\.]+?[\s+]+?(Delivering|Deliver)(.*terminal[\s+]+Diagnosis('s)?)?[\.]?)

I don't think that I am grasping the look-around concept properly. help?

Negative look-ahead does what it says: specifies a group that cannot match after your main expression. But you don't have anything before it.

Use negative lookbehind:

(?<!see|under)

in lieu of (?!see ) .

Other comments: you have a case error (terminal should be Terminal) and if you make your entire string "raw" by prepending it with an r like r'my string' you don't need to double-escape characters like \\n .

Try the following..

For whatever case you are using for matching, I would use r in front of your regular expression. r is Python's raw string notation for regular expression patterns and to avoid escaping, and to avoid the fact of uppercase or lowercase to look for, use re.I for case-insensitive matching.

Here's a possible solution using double Negative Lookbehind's.

(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)

See live demo


By example of using the raw string notation and re.I , this is what I meant.

matches = re.findall(r"(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)", s, re.I)
print matches

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM