I'm having a hard time understanding why the following expression \\[B.+\\]
and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName]
, I get one match - good.
If the markup contains [BName] [BAddress]
, I get one match - why?
If the markup contains [BName][BAddress]
, I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress]
you will have one match - which will match the entire string; so it will match from the [B
at the beginning all the way to the last ]
- instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ?
after the +
tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ]
as part of the content, like so: \\[B[^\\]]+\\]
That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier @
for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = @"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\]
instead. .+
on it's own (same is pretty much true for .*
) will match against as many characters as possible, whereas .+?
(or .*?
) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
.+
is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress
.
You should write \[B[^\]]+\]
.
[^\]]
matches every character except ]
, so it is forced to stop before the first ]
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.