简体   繁体   中英

How to use C# Regular Expression Lookbehind with line anchors

I am having trouble with look-behind assertions in C#'s regular expression matching when also using line-begin & line-end anchors. In the sample below, Regex B behaves exactly as I expect (and as documented here: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference

I was initially surprised that RegEx A did not match Line 1. Now I think I understand why RegEx A does NOT match line 1. [because the assertion is zero width - the expression is basically ^\\d{2}$, which clearly doesn't match a 4 digit year - which is why it matches lines 6 & 7].

I know I can rewrite the positive assertion (RegEx A) like this: ^19\\d{2}$.

But my ultimate goal is a regular expression like RegEx C - using a negative assertion to find all the strings that don't start with a given prefix. That is, I am trying to create a expression with a negative assertion that returns true for Lines 3 and 4 and not 5-7.

RegEx D is a similar negative-assertion sample from the C# documentation, but doesn't use begin/end anchors, and is true for lines 3 and 4, but also 5-7.

With that in mind, how can I make negative assertions (like RegEx C) work with line-begin/-end anchors so that it functions like the example from RegEx D while validating the input is a single line?

I'm wondering if this is simply not possible using assertions. That would mean the alternative is to express all the positive cases that evaluate to the negation of the exception (similar to using 19 in Regex E), but it's either impossible or impractical to express a large set of positives when the goal is to exclude a particular single (perhaps-complex) case.

Thanks!

Sample Program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

namespace RegExTest
{
    class Program
    {
        static void Main(string[] args)
        {
            string[] reList = new string[]
            {
                @"^(?<=19)\d{2}$",   // RegEx A
                @"(?<=19)\d{2}",     // RegEx B
                @"^(?<!19)\d{2}$",   // RegEx C
                @"(?<!19)\d{2}\b",   // RegEx D
                @"^19\d{2}$",        // RegEx E
            };

            string[] tests = new string[]
            {
                "1999",                     // Line 1
                "1851 1999 1950 1905 2003", // Line 2
                "1895",                     // Line 3
                "2095",                     // Line 4
                "195",                      // Line 5
                "18",                       // Line 6
                "19",                       // Line 7
            };
            foreach (var r in reList)
            {
                var re = new Regex(r);
                Console.WriteLine("");
                Console.WriteLine($"{r}");
                Console.WriteLine("==========================");
                foreach (var s in tests)
                {
                    Console.WriteLine($"{s}={re.IsMatch(s)}");
                    if (re.IsMatch(s))
                    {
                        foreach (Match m in re.Matches(s))
                        {
                            Console.WriteLine($"Match @ ({m.Index}, {m.Length})");
                        }
                    }
                }
            }
        }
    }
}

Output:

^(?<=19)\d{2}$
==========================
1999=False
1851 1999 1950 1905 2003=False
1895=False
2095=False
195=False
18=False
19=False

(?<=19)\d{2}
==========================
1999=True
Match @ (2, 2)
1851 1999 1950 1905 2003=True
Match @ (7, 2)
Match @ (12, 2)
Match @ (17, 2)
1895=False
2095=False
195=False
18=False
19=False

^(?<!19)\d{2}$
==========================
1999=False
1851 1999 1950 1905 2003=False
1895=False
2095=False
195=False
18=True
Match @ (0, 2)
19=True
Match @ (0, 2)

(?<!19)\d{2}\b
==========================
1999=False
1851 1999 1950 1905 2003=True
Match @ (2, 2)
Match @ (22, 2)
1895=True
Match @ (2, 2)
2095=True
Match @ (2, 2)
195=True
Match @ (1, 2)
18=True
Match @ (0, 2)
19=True
Match @ (0, 2)

^19\d{2}$
==========================
1999=True
Match @ (0, 4)
1851 1999 1950 1905 2003=False
1895=False
2095=False
195=False
18=False
19=False

You are confusing lookaround assertions with default behavior of a normal pattern. A lookaround asserts that means it doesn't consume characters.

It looks for a condition, if it satisfies then brings back cursor at where it began otherwise it makes engine to backtrack or fail immediately.

Regex A ^(?<!19)\\d{2}$ should not match string 1 1999 because engine works this way:

  1. ^ Assert beginning of string (we are at position 0)
  2. (?<!19) Check if preceding characters are not 19 (for sure at position 0 we don't have a preceding character so this satisfies)
  3. \\d{2} Consume two digits (we are at position 2)
  4. $ Assert end of string (Actually we have 2 more characters to reach end of string so engine fails immediately)

So you have to do this ^\\d{2}(?<!19)\\d{2}$ or ^(?!19)\\d{4}$ that the second is more suitable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM