简体   繁体   中英

do not capture before string

I have a series of tokens to capture:

sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)

Right now, I wrote this regex expression but it does not fulfil my requirements:

 ^(?!not)^(?!bitterly)(sweet|SWEET|Sweet)(ed|ED)?

This expression does not capture any of the terms. What lookahead should I use to capture this?

PS I'm using Python for this

Approach 1: match and capture what you need and just match the rest

You may leverage the re.findall that only returns the capturing group values if the capturing groups are defined in the pattern. You just need to match what you want to ignore, and match and capture what you need to obtain. However, it will also return empty elements when the capturing group fails to match, that is why filter(None, results) will come handy.

Here is a Python snippet :

import re
s = '''sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)'''
print(filter(None, re.findall(r'\b(?:bitterly|not)\s+sweet|\b(sweet\w*)\b', s, flags=re.I)))
# => ['sweet', 'SWEETENED']

Here,

  • \\b(?:bitterly|not)\\s+sweet - matches whole words bitterly or not followed with 1+ whitespaces and sweet substring
  • | - or
  • \\b(sweet\\w*)\\b - a whole word sweet with any other word chars after it (you may use your own pattern instead of \\w* )
  • flags=re.I - makes the pattern case insensitive.

See the regex demo (green texts will only be kept thanks to the code).

Approach 2: Lookbehinds that do not allow much control over the input

A couple of words about a negative lookbehind approach: I do not think it is good in this case since lookbehinds in Python re are fixed-width, and all alternatives in the lookbehind must be of the same width.

You might use

(?i)(?<!\bbitterly )(?<!\bnot )\bsweet\w*

(see demo ), but it would fail if there are 2 or 3 spaces in between bitterly or sweet .

Approach 3: Variable width lookbehind with PyPi regex module

Another interesting solution is using PyPi regex module where you can use variable width lookbehinds:

import regex
s='''sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)'''
rx = r'(?<!\b(?:bitterly|not)\s+)\bsweet\w*\b'
print(regex.findall(rx, s, flags=regex.I))
# => ['sweet', 'SWEETENED']

See the Python demo on REXTESTER .

The whole word sweet (with any word chars at the end) is matched only if there is no \\b(?:bitterly|not)\\s+ pattern before it.

In addition to @Wiktor's answer which mimicks (*SKIP)(*FAIL) you might be getting along with a neg. lookahead:

(?!.*\b(?:not|bitterly))(?i)sweet(?:ened)?

See a demo on regex101.com .


Disadvantage (maybe?): the position of not and bitterly is not taken into account, thus a sentence like

sweet and not sour

won't be matched. It's up to you to decide whether this is desired or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM