简体   繁体   中英

Regex to match the first occurrence of a string between quotes, but exclude certain words?

I have a regular expression that catches all characters within quotes in a text file, but I'd like to:

  • match only the first occurrence of this pattern
  • exclude certain words from this pattern match

Here's what I have so far:

((?:\\.|[^"\\])*)

That matches all text within quotes, like so:

" this is some text with the word printed in it? "

However, I'd like the pattern to match only the first occurrence, so I think I would need a {1} at some point.

Then, I want to exclude certain words, and I have this:

^(?!.*word1|word2|word3)

But I'm not familiar enough with regex to put it all together..

I think you can use this regex to match the first occurrence of a string in double quotation marks that does not contain a word from a list:

^.*?(?!"[^"]*?\b(?:word1|word2|word3)\b[^"]*?")"([^"]+?)"(?=(?:(?:[^"]*"[^"]*){2})*[^"]*$)

See demo

Sample code :

import re
p = re.compile(ur'^.*?(?!"[^"]*?\b(?:word1|word2|word3)\b[^"]*?")"([^"]+?)"(?=(?:(?:[^"]*"[^"]*){2})*[^"]*$)')
test_str = u"\"word that is not matched word1\" \"word2 word1 word3\" \"this is some text word4 with the word printed in it?\""
print re.search(p, test_str).group(1)

Output:

this is some text word4 with the word printed in it? 

As for maintainability, the excluded words can be pulled from any source, and the regex can be built dynamically.

Does it have to be a single regex to tackle all these requirements at once? Your code would probably stay much more maintainable if you'd just use a simple regex to find quoted strings, then filter all matches against the excluded words blacklist and finally choose the first that remains.

excluded = ('excluded', 'forbidden')
text = 'So, "this string contains an excluded word". "This second string is thus the one we want to find!" another qu"oted st"ring ... and another "quoted string with a forbidden word"'

import re
quoted_strings = re.findall('".*?"', text)
allowed_quoted_strings = [q for q in quoted_strings if any(e in q for e in excluded)]
wanted_string = allowed_quoted_strings[0]

or if you prefer it in one giant single expression

import re
wanted_string = [q for q in re.findall('".*?"', 'So, "this string contains an excluded word". "This second string is thus the one we want to find!" another qu"oted st"ring ... and another "quoted string with a forbidden word"') if any(e in q for e in ('excluded', 'forbidden'))][0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM