简体   繁体   中英

Regex match all words except those between quotes

In this example I want to select all words, except those between quotes (ie "results", "items", "packages", "settings" and "build_type", but not "compiler.version").

results[0].items[0].packages[0].settings["compiler.version"] 
results[0].items[0].packages[0].settings.build_type

Here's what I know: I can target all words with

[a-z_]+

and then target what's in between quotes with this:

(?<=\")[\w.]+(?=\")

Is there any way to match the difference between the results of the first and second regex? (ie words except if they are surrounded by double quotes)

Here 'sa regex playground with the example for convenience

I believe this is the cleaner/simpler version of the solution you were searching for:

(?<!\")\b[a-z_]+\b(?!\")

Here's a demo

Please let me know if this was helpful/if this was what you wanted!

You can match strings between double quotes and then match and capture words optionally followed with dot separated words:

list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))

See the regex demo . Details :

  • "[^"]*" - a " char, zero or more chars other than " and then a " char
  • | - or
  • ([a-z_]\\w*(?:\\.[a-z_]\\w*)*) - Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a . and then a letter or underscore followed with zero or more word chars.

See the Python demo :

import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']

The re.ASCII option is used to make \\w match [a-zA-Z0-9_] without accounting for Unicode chars.

A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.

[a-z_]+(?=(?:(?:[^\"\n]*\"){2})*[^\"\n]*\n)

Demo

The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).

[a-z_]+         # match one or more of the indicated characters
(?=             # begin a positive lookahead
  (?:           # begin an outer non-capture group
    (?:         # begin an inner non-capture group
      [^\"\n]*  # match zero or more characters other than " and \n 
      \"        # match "
    ){2}        # end inner non-capture group and execute twice
  )*            # end outer non-capture group and execute zero or more times
  [^\"\n]*      # match zero or more characters other than " and \n 
  \n            # match a newline
)               # end positive lookahead

\\n should be replaced by (?:\\n|$) if the last line may not have a line terminator.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM