为什么我的正则表达式不能使HTML标签名称正常工作？

Question

I'm trying to get tag names from HTML strings using the following Regex: 我正在尝试使用以下正则表达式从HTML字符串获取标签名称：

<(.*)(?:\s+(\S*)=.*)?>.*<\/\1?>

Here's some HTML I'm applying it to: 这是我将其应用到的一些HTML：

<p><a href="http://www.quackit.com/html/tutorial/html_links.cfm">Example Link</a></p>
<div class="more-info"><a href="http://www.quackit.com/html/examples/html_links_examples.cfm">More Link Examples...</a></div>

As expected, I am getting p and div as matches. 不出所料，我将p和div作为匹配项。 But for some reason this isn't detecting a . 但是由于某种原因，这并未检测a 。 Why not? 为什么不？

Answer 1

Here is a RegEx to match all HTML tags with all possibilities as follows: 这是一个RegEx，可将所有HTML标记与所有可能的方式进行匹配，如下所示：

<(?(?=!--)!--[\s\S]*--|(?(?=\?)\?[\s\S]*\?|(?(?=\/)\/[^.-\d][^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*|[^.-\d][^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*(?:\s[^.-\d][^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*(?:=(?:"[^"]*"|'[^']*'|[^'"<\s]*))?)*)\s?\/?))>

Explanation: 说明：

<                                                       # Tags always begin
  (?                                                    # What if...
    (?=!--)                                             # We have a comment?
      !--[\s\S]*--                                      # If so, anything goes between <!-- and -->.
      |                                                 # OR
      (?                                                # What if...
        (?=\?)                                          # We have a scripting tag?
          \?[\s\S]*\?                                   # If so, anything goes between <? and ?>.
          |                                             # OR
          (?                                            # What if...
            (?=\/)                                      # We have a closing tag?
              \/                                        # It should begin with a /.
              [^.-\d]                                   # Then the tag name, which can't begin with any of these characters.
              [^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*        # And can't contain any of these characters.
              |                                         # OR... we must have some other tag.
              [^.-\d]                                   # Tag names can't begin with these characters.
              [^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*        # And can't contain any of these characters.
                (?:                                     # Do we have any attributes?
                  \s                                    # If so, they'll begin with a space character.
                  [^.-\d]                               # Followed by a name that doesn't begin with any of these characters.
                  [^\/\]'"[!#$%&()*+,;<=>?@^`{|}~ ]*    # And doesn't contain any of these characters.
                    (?:                                 # Does our attribute have a value?
                      =                                 # If so, the value will begin with an = sign.
                      (?:                               # The value could be:
                      "[^"]*"                           # Wrapped in double quotes.
                      |                                 # OR
                      '[^']*'                           # Wrapped in single quotes.
                      |                                 # OR
                      [^'"<\s]*                         # Not wrapped in anything.
                      )                                 # That does it for our attribute value.
                    )?                                  # If the attribute is boolean it won't need a value.
                )*                                      # We could have any number of attributes.
          )                                             # That does it for our closing vs other tag check.
          \s?                                           # There could be some space characters before the closing >.
          \/?                                           # There might also be a / if this is a self-closing tag.
      )                                                 # That does it for our script vs html tag check.
  )                                                     # That does it for our comment vs script tag check.
>

Answer 2

to answer the "why not?": the nested a tags are considered .* (anything) which means your regex can only match first level tags. 回答“为什么不呢？”：嵌套的标签被认为是。*（任何内容），这意味着您的正则表达式只能匹配第一级标签。 what you need to do is try to match nested tags recursively. 您需要做的是尝试递归匹配嵌套标签。 Annoyingly, Javascript does not provide the PCRE recursive parameter (?R), so it is far from easy to deal with the nested issue. 令人讨厌的是，Javascript没有提供PCRE递归参数（？R），因此要处理嵌套问题绝非易事。 It can be done however. 但是可以做到的。 check this article 检查这篇文章

为什么我的正则表达式不能使HTML标签名称正常工作？

问题描述

2 个解决方案

解决方案1
2 2017-07-03 09:57:53

解决方案2
1 已采纳 2017-07-03 10:02:24

为什么我的正则表达式不能使HTML标签名称正常工作？

问题描述

2 个解决方案

解决方案1 2 2017-07-03 09:57:53

解决方案2 1 已采纳 2017-07-03 10:02:24

解决方案1
2 2017-07-03 09:57:53

解决方案2
1 已采纳 2017-07-03 10:02:24