简体   繁体   中英

Perl regular expression explanation

I have regular expression like this:

 s/<(?:[^>'"]|(['"]).?\1)*>//gs

and I don't know what exactly does it mean.

The regex looks intended to remove HTML tags from input.

It matches text beginning with < and ending with > , containing non- > /non-quotes or quoted strings (which may contain > ). But it appears to have an error:

The .? says that quotes may contain 0 or 1 character; it was probably intended to be .*? (0 or more characters). And to prevent backtracking from doing things like making the . match a quote in some odd cases, it needs to change the (?: ... ) grouping to be possessive ( > instead of : ).

This tool can explain the details: http://rick.measham.id.au/paste/explain.pl?regex=%3C%28%3F%3A[^%3E%27%22]|%28[%27%22]%29.%3F\\1%29*%3E

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [^>'"]                   any character except: '>', ''', '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      ['"]                     any character of: ''', '"'
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    .?                       any character except \n (optional
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    \1                       what was matched by capture \1
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  >                        '>'

So it tries to remove HTML tags as ysth also mentions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM