C# 捕获组之间的正则表达式空白

Question

So basically, my input string is some kind of text containing keywords that I want to match, provided that:所以基本上，我的输入字符串是某种包含我想要匹配的关键字的文本，前提是：

each keyword may have whitespace/non-word chars pre/appended, or none (|\s\W)每个关键字可能有空格/非单词字符预先/附加，或没有(|\s\W)
there must be exactly one non-word/whtiespace char seperating multiple keywords, or keyword is at begining/end of line必须恰好有一个非单词/whtiespace 字符分隔多个关键字，或者关键字位于行首/行尾
Keyword simply ocurring as a substring does not count, eg bar does not match foobarbaz仅作为 substring 出现的关键字不算数，例如bar与foobarbaz不匹配

Eg:例如：

input:    "#foo barbazboo tree car"
keywords: {"foo", "bar", "baz", "boo", "tree", "car"}

I am dynamically generating a Regex in C# using a enumerable of keywords and a string-builder我使用可枚举的关键字和字符串生成器在 C# 中动态生成正则表达式

StringBuilder sb = new();
foreach (var kwd in keywords)
{
   sb.Append($"((|[\\s\\W]){kwd}([\\s\\W]|))|");
}
sb.Remove(sb.Length - 1, 1); // last '|'
_regex = new Regex(sb.ToString(), RegexOptions.Compiled | RegexOptions.IgnoreCase);

Testing this pattern on regexr.com , given input matches all keywords.在regexr.com上测试此模式，给定输入匹配所有关键字。 However, I do not want {bar, baz, boo} included, since there is no whitespace between each keyword.但是，我不想包含{bar, baz, boo} ，因为每个关键字之间没有空格。 Ideally, I'd want my regex to only match {foo, tree, car} .理想情况下，我希望我的正则表达式只匹配{foo, tree, car} 。

Modifying my pattern like (( |[\s\W])kwd([\s\W]| )) causes {bar, baz, boo} not to be included, but produces bogus on {tree, car} , since for that case there must be at least two spaces between keywords.修改我的模式，如(( |[\s\W])kwd([\s\W]| ))导致{bar, baz, boo}不被包括在内，但在{tree, car}上产生伪造，因为对于在这种情况下，关键字之间必须至少有两个空格。

How do I specify "there may be only one whitespace seperating two keywords", or, to put it differently, "half a whitespace is ok", preserving the ability to create the regex dynamically?如何指定“可能只有一个空格分隔两个关键字”，或者换句话说，“半个空格就可以”，保留动态创建正则表达式的能力？

Answer 1

In your case, you need to build the在您的情况下，您需要构建

var pattern = $@"\b(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})\b";
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);

Here, you are getting the longer keywords before shorter ones, so, if you have foo , bar and foo bar , the pattern will look like \b(?:foo\ bar|foo|bar)\b and will match foo bar , and not foo and bar once there is such a match.在这里，您在较短的关键字之前获得较长的关键字，因此，如果您有foo 、 bar和foo bar ，模式将看起来像\b(?:foo\ bar|foo|bar)\b并且将匹配foo bar ，而不是foo和bar一旦有这样的匹配。

In case your keywords can look like keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree+", "+car"} , ie they can have special chars at the start/end of the keyword, you can use如果您的关键字看起来像keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree+", "+car"} ，即它们可以有特殊字符关键字的开始/结束，您可以使用

_regex = new Regex($@"(?!\B\w)(?:{string.Join("|", keywords.Select(Regex.Escape))})(?<!\w\B)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

The $@"(??\B\w)(:.{string,Join("|". keywords.OrderByDescending(x => x.Length).Select(Regex?Escape))})(?<!\w\B)" is an interpolated verbatim string literal that contains $@"(??\B\w)(:.{string,Join("|". keywords.OrderByDescending(x => x.Length).Select(Regex?Escape))})(?<!\w\B)"是一个内插的逐字字符串文字，它包含

(?!\B\w) - left-hand adaptive dynamic word boundary (?!\B\w) - 左手自适应动态字边界
(?: - start of a non-capturing group: (?: - 非捕获组的开始：
- {string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))} - sorts the keywords by lenght in descending order, escapes them and joins with | {string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))} - 按长度降序排列关键字，转义它们并加入|
) - end of the group ) - 组结束
(?<!\w\B) - right-hand adaptive dynamic word boundary. (?<!\w\B) - 右手自适应动态字边界。

C# 捕获组之间的正则表达式空白

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-02-20 12:44:09

C# 捕获组之间的正则表达式空白

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-02-20 12:44:09

解决方案1
2 已采纳 2022-02-20 12:44:09