简体   繁体   中英

Regex.Matches returns one match per line, not per “word”

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:

string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);

I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.

If the markup contains [BName] , I get one match - good.

If the markup contains [BName] [BAddress] , I get one match - why?

If the markup contains [BName][BAddress] , I also only get one match.

On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.

I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.

You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.

So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]

The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.

Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.


On a side note, I recommend using the C# "literal string" specifier @ for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:

string pattern = @"\[B.+?\]";

This makes it much easier to figure out regular expressions that are more complex

Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .* ) will match against as many characters as possible, whereas .+? (or .*? ) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.

.+ is a greedy match; it will match as much as possible.
In your second example, it matches BName] [BAddress .

You should write \[B[^\]]+\] .
[^\]] matches every character except ] , so it is forced to stop before the first ] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM