do not capture before string

Question

I have a series of tokens to capture:

sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)

Right now, I wrote this regex expression but it does not fulfil my requirements:

 ^(?!not)^(?!bitterly)(sweet|SWEET|Sweet)(ed|ED)?

This expression does not capture any of the terms. What lookahead should I use to capture this?

PS I'm using Python for this

Answer 1

Approach 1: match and capture what you need and just match the rest

You may leverage the re.findall that only returns the capturing group values if the capturing groups are defined in the pattern. You just need to match what you want to ignore, and match and capture what you need to obtain. However, it will also return empty elements when the capturing group fails to match, that is why filter(None, results) will come handy.

Here is a Python snippet :

import re
s = '''sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)'''
print(filter(None, re.findall(r'\b(?:bitterly|not)\s+sweet|\b(sweet\w*)\b', s, flags=re.I)))
# => ['sweet', 'SWEETENED']

Here,

\\b(?:bitterly|not)\\s+sweet - matches whole words bitterly or not followed with 1+ whitespaces and sweet substring
| - or
\\b(sweet\\w*)\\b - a whole word sweet with any other word chars after it (you may use your own pattern instead of \\w* )
flags=re.I - makes the pattern case insensitive.

See the regex demo (green texts will only be kept thanks to the code).

Approach 2: Lookbehinds that do not allow much control over the input

A couple of words about a negative lookbehind approach: I do not think it is good in this case since lookbehinds in Python re are fixed-width, and all alternatives in the lookbehind must be of the same width.

You might use

(?i)(?<!\bbitterly )(?<!\bnot )\bsweet\w*

(see demo ), but it would fail if there are 2 or 3 spaces in between bitterly or sweet .

Approach 3: Variable width lookbehind with PyPi regex module

Another interesting solution is using PyPi regex module where you can use variable width lookbehinds:

import regex
s='''sweet (capture)
SWEETENED (capture)
not sweet (do not capture)
bitterly sweet (do not capture)'''
rx = r'(?<!\b(?:bitterly|not)\s+)\bsweet\w*\b'
print(regex.findall(rx, s, flags=regex.I))
# => ['sweet', 'SWEETENED']

See the Python demo on REXTESTER .

The whole word sweet (with any word chars at the end) is matched only if there is no \\b(?:bitterly|not)\\s+ pattern before it.

Answer 2

In addition to @Wiktor's answer which mimicks (*SKIP)(*FAIL) you might be getting along with a neg. lookahead:

(?!.*\b(?:not|bitterly))(?i)sweet(?:ened)?

See a demo on regex101.com .

Disadvantage (maybe?): the position of not and bitterly is not taken into account, thus a sentence like

sweet and not sour

won't be matched. It's up to you to decide whether this is desired or not.

do not capture before string

Question

2 answers

solution1
3 ACCPTED 2017-06-20 10:45:12

Approach 1: match and capture what you need and just match the rest

Approach 2: Lookbehinds that do not allow much control over the input

Approach 3: Variable width lookbehind with PyPi regex module

solution2
0 2017-06-20 10:50:06

do not capture before string

Question

2 answers

solution1 3 ACCPTED 2017-06-20 10:45:12

Approach 1: match and capture what you need and just match the rest

Approach 2: Lookbehinds that do not allow much control over the input

Approach 3: Variable width lookbehind with PyPi regex module

solution2 0 2017-06-20 10:50:06

solution1
3 ACCPTED 2017-06-20 10:45:12

solution2
0 2017-06-20 10:50:06