简体   繁体   中英

RegEx for matching strings with spaces and words

I have the following string:

the quick brown fox abc(1)(x)

with the following regex:

(?i)(\s{1})(abc\(1\)\([x|y]\))

and the output is

abc(1)(x)

which is expected, however, I can't seem to:

  1. use \\W \\w \\d \\D etc to extract more than 1 space
  2. combine the quantifier to add more spaces.

I would like the following output:

the quick brown fox abc(1)(x)

from the primary lookup "abc(1)(x)" I would like up to 5 words on either side of the lookup. my assumption is that spaces would demarcate a word.

Edit 1:

The 5 words on either side would be unknown for future examples. the string may be:

cat with a black hat is abc(1)(x) the quick brown fox jumps over the lazy dog.

In this case, the desired output would be:

with a black hat is abc(1)(x) the quick brown fox jumps

Edit 2:

edited the expected output in the first example and added "up to" 5 words

(?:[0-9A-Za-z_]+[^0-9A-Za-z_]+){0,5}abc\(1\)\([xy]\)(?:[^0-9A-Za-z_]+[0-9A-Za-z_]+){0,5}

Note that I've changed \\w+ to [0-9A-Za-z_]+ and \\W+ to [^0-9A-Za-z_]+ because depending on your locale / Unicode settings \\W and \\w might not act the way you expect in Python.

Also note I don't specifically look for spaces, just "non-word characters" this probably handles edge cases a little better for quote characters etc. But regardless this should get you most of the way there.

BTW: You calling this "lookaround" - really it has nothing to do with "regex lookaround" the regex feature.

If I understand your requirements correctly, you want to do something like this:

(?:\w+[ ]){0,5}(abc\(1\)\([xy]\))(?:[ ]\w+){0,5}

Demo .

BreakDown:

(?:               # Start of a non-capturing group.
    \w+           # Any word character repeated one or more times (basically, a word).
    [ ]           # Matches a space character literally.
)                 # End of the non-capturing group.
{0,5}             # Match the previous group between 0 and 5 times.
(                 # Start of the first capturing group.
    abc\(1\)      # Matches "abc(1)" literally.
    \([xy]\)      # Matches "(x)" or "(y)". You don't need "|" inside a character class.
)                 # End of the capturing group.
(?:[ ]\w+){0,5}   # Same as the non-capturing group above but the space is before the word.

Notes:

  • To make the pattern case insensitive, you may start it with (?i) as you're doing already or use the re.IGNORECASE flag .
  • If you want to support words not separated by a space, you may replace [ ] with either \\W+ (which means non-word characters) or with a character class which includes all the punctuation characters that you want to support (eg, [.,;?! ] ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM