简体   繁体   中英

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:

\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]

I understand that .*? does a non-greedy match of everything in the second register.

What does ?:\\s* in the first and third registers do?

Update: As requested, language is C# on .NET 3.5

The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.

The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \\1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string

?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.

What does ?:\\s* in the first and third registers do?

It's matching zero or more whitespace characters, without capturing them.

The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:

[url]www.foo.com[/url]     # foo.com
[url  ]www.foo.com[/url  ] # same
[url  ]www.foo.com[/url]   # same
[url]www.foo.com[/url  ]   # same

Note that the regex also matches:

[url]www.[/url]      # empty string!

and fails to match

[url]stackoverflow.com[/url]  # no match, bummer

You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.

http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM