简体   繁体   English

正则表达式匹配除给定列表之外的所有单词

[英]Regex to match all words except a given list

I am trying to write a replacement regular expression to surround all words in quotes except the words AND, OR and NOT.我正在尝试编写一个替换正则表达式来将所有单词括在引号中,除了单词 AND、OR 和 NOT。

I have tried the following for the match part of the expression:我为表达式的匹配部分尝试了以下内容:

(?i)(?<word>[a-z0-9]+)(?<!and|not|or)

and

(?i)(?<word>[a-z0-9]+)(?!and|not|or)

but neither work.但两者都不起作用。 The replacement expression is simple and currently surrounds all words.替换表达式很简单,目前包含所有单词。

"${word}"

So所以

This and This not That这个和这个不是那个

becomes变成

"This" and "This" not "That" “这个”和“这个”不是“那个”

This is a little dirty, but it works:这有点脏,但它有效:

(?<!\b(?:and| or|not))\b(?!(?:and|or|not)\b)

In plain English, this matches any word boundary not preceded by and not followed by "and", "or", or "not".在简单的英语中,这匹配任何前面没有“and”、“or”或“not”的词边界。 It matches whole words only, eg the position after the word "sand" would not be a match just because it is preceded by "and".它只匹配整个单词,例如单词“sand”之后的位置不会仅仅因为它前面有“and”而匹配。

The space in front of the "or" in the zero-width look-behind assertion is necessary to make it a fixed length look-behind.零宽度后视断言中“或”前面的空格是使其成为固定长度后视所必需的。 Try if that already solves your problem.如果这已经解决了您的问题,请尝试。

EDIT: Applied to the string "except the words AND, OR and NOT."编辑:应用于字符串“除了单词 AND、OR 和 NOT”。 as a global replace with single quotes, this returns:作为单引号的全局替换,它返回:

'except' 'the' 'words' AND, OR and NOT.

John,约翰,

The regex in your question is almost correct.您问题中的正则表达式几乎是正确的。 The only problem is that you put the lookahead at the end of the regex instead of at the start.唯一的问题是您将前瞻放在正则表达式的末尾而不是开头。 Also, you need to add word boundaries to force the regex to match whole words.此外,您需要添加单词边界以强制正则表达式匹配整个单词。 Otherwise, it will match "nd" in "and", "r" in "or", etc, because "nd" and "r" are not in your negative lookahead.否则,它将匹配“and”中的“nd”、“or”中的“r”等,因为“nd”和“r”不在您的负面预测中。

(?i)\\b(?!and|not|or)(?[a-z0-9]+)\\b (?i)\\b(?!and|not|or)(?[a-z0-9]+)\\b

Call me crazy, but I'm not a fan of fighting regex;说我疯了,但我不喜欢与正则表达式作斗争; I limit my patterns to simple things I can understand, and often cheat for the rest - for example via a MatchEvaluator :我将我的模式限制在我可以理解的简单事物上,并且经常在其余部分作弊 - 例如通过MatchEvaluator

    string[] whitelist = new string[] { "and", "not", "or" };
    string input = "foo and bar or blop";
    string result = Regex.Replace(input, @"([a-z0-9]+)",
        delegate(Match match) {
            string word = match.Groups[1].Value;
            return Array.IndexOf(whitelist, word) >= 0
                ? word : ("\"" + word + "\"");
        });

(edited for more terse layout) (编辑为更简洁的布局)

Based on Tomalaks answer:基于 Tomalaks 的回答:

(?<!and|or|not)\b(?!and|or|not)

This regex has two problems:这个正则表达式有两个问题:

  1. (?<! ) only works for fixed length look-behind (?<! )仅适用于固定长度的后视

  2. The previous regex only looked at end ending/beginning of the surrounding words, not the whole word.之前的正则表达式只查看周围单词的结尾/开头,而不是整个单词。

(?<!\\band)(?<!\\bor)(?<!\\bnot)\\b(?!(?:and|or|not)\\b)

This regex fixes both the above problems.这个正则表达式解决了上述两个问题。 First by splitting the look-behind into three separate ones.首先将后视分为三个独立的。 Second by adding word-boundaries ( \\b ) inside the look-arounds.其次,通过在环视中添加字边界( \\b )。

To match any "word" that is a combination of letters, digits or underscores (including any other word chars defined in the \\w shorthand character class ) , you may use word boundaries like in要匹配由字母、数字或下划线组合而成的任何“单词”(包括\\w速记字符类中定义的任何其他单词字符) ,您可以使用单词边界,

\b(?!(?:word1|word2|word3)\b)\w+

If the "word" is a chunk of non-whitespace characters with start/end of string or whitespace on both ends use whitespace boundaries like in如果“单词”是一大块非空白字符,两端有字符串或空白的开始/结束,请使用空白边界,

(?<!\S)(?!(?:word1|word2|word3)(?!\S))\S+

Here, the two expressions will look like在这里,两个表达式看起来像

\b(?!(?:and|not|or)\b)\w+
(?<!\S)(?!(?:and|not|or)(?!\S))\S+

See the regex demo (or, a popular regex101 demo , but please note that PCRE \\w meaning is different from the .NET \\w meaning.)请参阅正则表达式演示(或流行的regex101 演示,但请注意 PCRE \\w含义与 .NET \\w含义不同。)

Pattern explanation图案说明

  • \\b - word boundary \\b -词边界
  • (?<!\\S) - a negative lookbehind that matches a location that is not immediately preceded with a character other than whitespace, it requires a start of string position or a whitespace char to be right before the current location (?<!\\S) - 一个负向后视匹配一个位置,该位置前面不是空格以外的字符,它需要字符串位置的开始或在当前位置之前的空格字符
  • (?!(?:word1|word2|word3)\\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is word1 , word2 or word3 char sequences followed with a word boundary (or, if (?!\\S) whitespace right-hand boundary is used, there must be a whitespace or end of string immediately to the right of the current location) (?!(?:word1|word2|word3)\\b) - 如果在当前位置的右侧有word1word2word3字符序列后跟单词边界(或, 如果使用(?!\\S)空格右侧边界,则在当前位置的右侧必须有空格或字符串结尾)
  • \\w+ - 1+ word chars \\w+ - 1+ 个字字符
  • \\S+ - 1+ chars other than whitespace \\S+ - 空格以外的 1+ 个字符

In C#, and any other programming language, you may build the pattern dynamically, by joining array/list items with a pipe character, see the demo below:在 C# 和任何其他编程语言中,您可以通过使用管道字符连接数组/列表项来动态构建模式,请参见下面的演示

var exceptions = new[] { "and", "not", "or" };
var result = Regex.Replace("This and This not That", 
        $@"\b(?!(?:{string.Join("|", exceptions)})\b)\w+",
        "\"$&\"");
Console.WriteLine(result); // => "This" and "This" not "That"

If your "words" may contain special characters, the whitespace boundaries approach is more suitable, and make sure to escape the "words" with, say, exceptions.Select(Regex.Escape) :如果您的“单词”可能包含特殊字符,则空白边界方法更合适,并确保使用exceptions.Select(Regex.Escape)来转义“单词”:

var pattern = $@"(?<!\S)(?!(?:{string.Join("|", exceptions.Select(Regex.Escape))})(?!\S))\S+";

NOTE : If there are too many words to search for, it might be a better idea to build a regex trie out of them.注意:如果要搜索的单词太多,最好使用它们构建正则表达式

(?!\bnot\b|\band\b|\bor\b|\b\"[^"]+\"\b)((?<=\s|\-|\(|^)[^\"\s\()]+(?=\s|\*|\)|$))

我使用这个正则表达式来查找所有不在双引号内的单词,或者是单词“not”、“and”或“or”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM