简体   繁体   English

C# 捕获组之间的正则表达式空白

[英]C# Regex whitespace between capturing groups

So basically, my input string is some kind of text containing keywords that I want to match, provided that:所以基本上,我的输入字符串是某种包含我想要匹配的关键字的文本,前提是:

  1. each keyword may have whitespace/non-word chars pre/appended, or none (|\s\W)每个关键字可能有空格/非单词字符预先/附加,或没有(|\s\W)
  2. there must be exactly one non-word/whtiespace char seperating multiple keywords, or keyword is at begining/end of line必须恰好有一个非单词/whtiespace 字符分隔多个关键字,或者关键字位于行首/行尾
  3. Keyword simply ocurring as a substring does not count, eg bar does not match foobarbaz仅作为 substring 出现的关键字不算数,例如barfoobarbaz不匹配

Eg:例如:

input:    "#foo barbazboo tree car"
keywords: {"foo", "bar", "baz", "boo", "tree", "car"}

I am dynamically generating a Regex in C# using a enumerable of keywords and a string-builder我使用可枚举的关键字和字符串生成器在 C# 中动态生成正则表达式

StringBuilder sb = new();
foreach (var kwd in keywords)
{
   sb.Append($"((|[\\s\\W]){kwd}([\\s\\W]|))|");
}
sb.Remove(sb.Length - 1, 1); // last '|'
_regex = new Regex(sb.ToString(), RegexOptions.Compiled | RegexOptions.IgnoreCase);

Testing this pattern on regexr.com , given input matches all keywords.regexr.com上测试此模式,给定输入匹配所有关键字。 However, I do not want {bar, baz, boo} included, since there is no whitespace between each keyword.但是,我不想包含{bar, baz, boo} ,因为每个关键字之间没有空格。 Ideally, I'd want my regex to only match {foo, tree, car} .理想情况下,我希望我的正则表达式只匹配{foo, tree, car}

Modifying my pattern like (( |[\s\W])kwd([\s\W]| )) causes {bar, baz, boo} not to be included, but produces bogus on {tree, car} , since for that case there must be at least two spaces between keywords.修改我的模式,如(( |[\s\W])kwd([\s\W]| ))导致{bar, baz, boo}不被包括在内,但在{tree, car}上产生伪造,因为对于在这种情况下,关键字之间必须至少有两个空格。

How do I specify "there may be only one whitespace seperating two keywords", or, to put it differently, "half a whitespace is ok", preserving the ability to create the regex dynamically?如何指定“可能只有一个空格分隔两个关键字”,或者换句话说,“半个空格就可以”,保留动态创建正则表达式的能力?

In your case, you need to build the在您的情况下,您需要构建

var pattern = $@"\b(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})\b";
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);

Here, you are getting the longer keywords before shorter ones, so, if you have foo , bar and foo bar , the pattern will look like \b(?:foo\ bar|foo|bar)\b and will match foo bar , and not foo and bar once there is such a match.在这里,您在较短的关键字之前获得较长的关键字,因此,如果您有foobarfoo bar ,模式将看起来像\b(?:foo\ bar|foo|bar)\b并且将匹配foo bar ,而不是foobar一旦有这样的匹配。

In case your keywords can look like keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree+", "+car"} , ie they can have special chars at the start/end of the keyword, you can use如果您的关键字看起来像keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree+", "+car"} ,即它们可以有特殊字符关键字的开始/结束,您可以使用

_regex = new Regex($@"(?!\B\w)(?:{string.Join("|", keywords.Select(Regex.Escape))})(?<!\w\B)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

The $@"(??\B\w)(:.{string,Join("|". keywords.OrderByDescending(x => x.Length).Select(Regex?Escape))})(?<!\w\B)" is an interpolated verbatim string literal that contains $@"(??\B\w)(:.{string,Join("|". keywords.OrderByDescending(x => x.Length).Select(Regex?Escape))})(?<!\w\B)"是一个内插的逐字字符串文字,它包含

  • (?!\B\w) - left-hand adaptive dynamic word boundary (?!\B\w) - 左手自适应动态字边界
  • (?: - start of a non-capturing group: (?: - 非捕获组的开始:
    • {string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))} - sorts the keywords by lenght in descending order, escapes them and joins with | {string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))} - 按长度降序排列关键字,转义它们并加入|
  • ) - end of the group ) - 组结束
  • (?<!\w\B) - right-hand adaptive dynamic word boundary. (?<!\w\B) - 右手自适应动态字边界。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM