简体   繁体   中英

regex that i don't understand

I'm trying to understand this regex, can you help me out?

(?s)\\{\\{wotd\\|(.+?)\\|(.+?)\\|([^#\\|]+).*?\\}\\}
  • I don't really understand the meaning of DOTALL : (?s)
  • why the double \\\\ before } ?
  • what does this exactly mean : (.+?) (should we read this like : the . , then + acting on the . , then ? responding to the result of .+ ?

This regex is from a string. The "canonical" regex is:

(?s)\{\{wotd\|(.+?)\|(.+?)\|([^#\|]+).*?\}\}

The DOTALL modifier means that the dot can also match a newline character, but so can complemented character classes, at least with Java: ie [^a] will match each and every character which is not a , newline included. Some regex engines do NOT match a newline in complemented character classes though (this can be regarded as a bug).

The +? and *? are lazy quantifiers (which should generally be avoided). It means that they will have to look forward before each character they want to swallow to see if this character can satisfy the next component of a regex.

The fact that { and } are preceded with \\ is because {...} is the repetition quantifier {n,m} where n and m are integers.

Also, it is useless to escape the pipe | in the character class [^#\\|] , it can be simply written as [^#|] .

And finally, .*? at the end seems to swallow the rest of the fields. A better alternative is to use the normal* (special normal*)* pattern, where normal is [^|}] and special is \\| .

Here is the regex without using lazy quantifiers, the "fixed" character class and the modified end. Note that the DOTALL modifier has disappeared as well, since the dot isn't used anymore:

\{\{wotd\|([^|]+)\|([^|]+)\|([^#|]+)[^|}]*(?:\|[^|}]*)*\}\}

Step by step:

\{\{         # literal "{{", followed by
wotd         # literal "wotd", followed by
\|           # literal "|", followed by
([^|]+)      # one or more characters which are not a "|" (captured), followed by
\|           # literal "|", followed by
([^|]+)      # one or more characters which are not a "|" (captured), followed by
\|           # literal "|", followed by
([^#|]+)     # one or more characters which are not "|" or "#", followed by
[^|}]*       # zero or more characters which are not "|" or "}", followed by
(?:          # begin group
  \|         # a literal "|", followed by
  [^|}]*     # zero or more characters which are not "|" or "}"
)            # end group
*            # zero or more times, followed by
\}\}         # literal "}}"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM