简体   繁体   中英

Regex - Pattern finds parts of itself with (.+)

In C#, I have the following Regex pattern (on an HTML string):

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.+)</tr>");

The problem is, that when I run it, the match includes everything until the last </tr> occurrence in the HTML code. There are many <tr> tags in the code, so the (.+) pattern includes them and stops only in the last occurrence of </tr> .

I've tried using (\\w+) instead, but it doesn't get certain characters inside the tags.

So how can I make this pattern stop at the first </tr> , and not go until the last one in the code?

The following Regex pattern will stop at the first </tr> tag:

<tr(\s+)class(\s*)=(\s*)"[^"]*"(\s+)rel(\s*)=(\s*)"[^"]*"(\s*)>(.(?!<\/tr>))*[\s\S]<\/tr>

You can change your code into following to get what you wanted:

Regex TR = new Regex(@"<tr class=""(\w+)""  rel=""(\w+)"">(.(?!<\/tr>))*[\s\S]</tr>");

(?!ABC) is called negative lookahead . It specifies a group that can not match after the main expression (if it matches, the result is discarded).

For future reference: Try using RegExr to create and test your regex patterns.

> So how can I make this pattern stop at the first </tr>

The most effective capturing process paradigm is to not consume blindly, but consume what is known.

Since the text to grab falls within the anchors of > and < , why not use that logic of the ending anchor, the < , to give the regex parser a hint?

By using the ^ character ( it is the not in a set ) in a set [ ] we effectively tell the parser to consume until a specific set of character(s) is hit.

In your case change

>(.+)</tr>

to [^<]+ which says consume everything until (or except for) when the < character is hit, one or more times:

>([^<]+)</tr>

The use of the [^ ] set is a powerful one which I use in 90% of my regex patterns instead of blinding consuming with .+ or the even more side affect prone .* .


Also to make your pattern easier to handle use \\x22 in lieu of " so you are not fighting with the C# parser before the regex parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM