简体   繁体   English

在单个字符串上运行多个RegEx模式

[英]Run multiple RegEx patterns on single string

I need to run a C# RegEx match on a string. 我需要在字符串上运行C#RegEx匹配。 Problem is, I'm looking for more than one pattern on a single string, and I cannot find a way to do that with a single run. 问题是,我正在单个字符串上寻找多个模式,但是我找不到一种可以单次运行的方式。

For example, in the string 例如,在字符串中

The dog has jumped

I'm looking for "dog" and for "dog has". 我在寻找“狗”和“狗有”。

I don't know how can I get those two results with one pass. 我不知道如何一口气获得那两个结果。

I've tried to concatenate the pattern with the alteration symbol (|), like that: 我试图将模式与更改符号(|)串联起来,如下所示:

(dog|dog has)

But it returned only the first match. 但是它只返回了第一场比赛。

What can I use to get back both the matches? 我可以用来找回这两次比赛吗?

Thanks! 谢谢!

You can use one regex pattern to do both. 您可以使用一种正则表达式模式来完成这两种操作。

Pattern: (dog\\b has\\b)|(dog\\b) 模式:(dog \\ b has \\ b)|(dog \\ b)

I figured out this pattern using the online builder here: enter link description here 我在这里使用在线生成器找出了这种模式:在此处输入链接描述

Then you can use it in C# with the regex class by doing something like 然后,您可以通过做类似的事情在C#中使用regex类来使用它

Regex reg = new Regex("(dog\b has\b)|(dog\b)", RegexOptions.IgnoreCase);
if (reg.IsMatch){
  //found dog or dog has
}

The regex engine will return the first substring that satisfied the pattern. 正则表达式引擎将返回满足该模式的第一个子字符串。 If you write (dog|dog has) , it won't ever be able to match dog has because dog has starts with dog , which is the first alternative. 如果您编写(dog|dog has) ,它将永远无法匹配dog has因为dog hasdog开头,这是第一种选择。 Furthermore, the regex engine won't return overlapping matches. 此外,正则表达式引擎不会返回重叠的匹配项。

Here's a convoluted method: 这是一个复杂的方法:

var patterns = new[] { "dog", "dog has" };

var sb = new StringBuilder();
for (var i = 0; i < patterns.Length; i++)
    sb.Append(@"(?=(?<p").Append(i).Append(">").Append(patterns[i]).Append("))?");

var regex = new Regex(sb.ToString(), RegexOptions.Compiled);
Console.WriteLine("Pattern: {0}", regex);

var input = "a dog has been seen with another dog";
Console.WriteLine("Input: {0}", input);

foreach (var match in regex.Matches(input).Cast<Match>())
{
    for (var i = 0; i < patterns.Length; i++)
    {
        var group = match.Groups["p" + i];
        if (!group.Success)
            continue;

        Console.WriteLine("Matched pattern #{0}: '{1}' at index {2}", i, group.Value, group.Index);
    }
}

This produces the following output: 这将产生以下输出:

Pattern: (?=(?<p0>dog))?(?=(?<p1>dog has))?
Input: a dog has been seen with another dog
Matched pattern #0: 'dog' at index 2
Matched pattern #1: 'dog has' at index 2
Matched pattern #0: 'dog' at index 33

Yes, this is an abuse of the regex engine :) 是的,这是对正则表达式引擎的滥用:)

This works by building a pattern using optional lookaheads, which capture the substrings as a side effect, but the pattern otherwise always matches an empty string. 这可以通过使用可选的lookaheads构建模式来实现,该lookaheads可以捕获子字符串作为副作用,但是否则该模式始终匹配一个空字符串。 So there are n+1 total matches, n being the input length. 因此,总共有n+1个匹配项, n是输入长度。 The patterns cannot contain numbered backreferences, but you can use named backreferences instead. 模式不能包含编号的反向引用,但是您可以改用命名的反向引用。

Also, this can return overlapping matches, as it will try to match all patterns at all string positions. 而且,这将返回重叠的匹配项,因为它将尝试在所有字符串位置匹配所有模式。

But you definitely should benchmark this against a manual approach (looping over the patterns and matching each of them separately). 但您绝对应该对照手动方法对此进行基准测试(遍历模式并分别匹配每个模式)。 I don't expect this to be fast ... 我不希望这很快 ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM