简体   繁体   中英

Get first occurence of match in Regex

I have the following text:

"cat dog mouse lion"

And I search for "dog" or "mouse" using regex:

Regex regex = new Regex(@"dog|mouse");

The way Regex in C# behaves is that it first searches all the way through for the word dog. If it finds a match, it stops. How do I make it stop after finding the first occurrence of any of my words in the regex, meaning stop after "cat" as this occurs first?

Do I have to make multiple regex searches and match the indexes of the findings? Or is it possible to specify it in the regex expression?

No, you are wrong.

Regex regex = new Regex(@"dog|mouse");

and

Regex regex = new Regex(@"mouse|dog");

both will find the word "dog", even when like in the second case the word "mouse" is the first in the alternation.

The matching behaviour is different, than you described. The regex will check at the first char if it can match the first alternative, if this does not match, it will not continue to the second character, it will try the second alternative.

But, the ordering of the alternation is important in another aspect . You will get problems, when you have alternatives with the same beginnning and you order them from short to long, eg

Regex regex = new Regex(@"Foo|Foobar");

this will never match the word "Foobar", since even when there is Foobar in the text it matches on the first alternative "Foo".

To avoid those problems, order it from long to short

Regex regex = new Regex(@"Foobar|Foo");

this will try to match "Foobar" on "Foo" and when it recognizes, there is no "b" following, it tries the second alternative and matches successfully "Foo".

A way to do that is to use a lazy quantifier with dotall option:

Regex regex = new Regex(@"^.*?\b(?>dog|mouse)\b");

Another way is to do that;

Regex regex = new Regex(@"^(?>[^dm]*+|d++(?!og\b)|m++(?!ouse\b))*\b(?>dog|mouse)\b");

it is longer but more efficient. The idea is to avoid lazy quantifier that is slow because it tests on each characters to see what follows. Here i describe the begining as "all that is not a d or a m OR some d not followed by og OR some m not followed by ouse zero or more times.

(?>..) is an atomic group, this is to avoid that the regex engine backtrack, it is a kind of 'all or nothing', more informations here

++ is a possessive quantifier that avoid backtracks too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM