Java正则表达式捕获HTML标记及其属性

Question

（我知道正则表达式不是处理html的推荐方法，但这是我的工作）

我需要Java中的正则表达式来捕获html标签及其属性。 我正在尝试使用一个正则表达式使用组来实现这一点。 我希望这个正则表达式可以正常工作：

<(?!!)(?!/)\s*(\w+)(?:\s*(\S+)=['"]{1}[^>]*?['"]{1})*\s*>
<                                                            the tag starts with <
 (?!!)                                                       I dont't want comments
      (?!/)                                                  I dont't want closing tags
           \s*                                               any number of white spaces 
              (\w+)                                          the tag
                   (?:                                       do not capture the following group
                      \s*                                    any number of white spaces before the first attribute
                         (\S+)                               capture the attributes name
                              =['"]{1}[^>]*?['"]{1}          the ="bottm" or ='bottm' etc.
                                                   )*        close the not-capturing group, it can occure multiple times or zero times
                                                     \s*     any white spaces before the closing of the tag
                                                        >    close the tag

我期望标签的结果如下：

<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"

但结果是：

group(1) = "div"
group(2) = "class"

似乎不可能多次捕获一个组（...）*，这是正确的吗？

至于现在我使用reg ex像：

<(?!!)(?!/)\s*(\w+) (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (...){0,1} (...){0,1} ... \s*>

我多次对该属性重复捕获组，并得到如下结果：

<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class" 
group(4) = null 
group(5) = null 
group(6) = null 
...

我还可以使用哪些其他方法？ （我可以使用多个正则表达式，但最好只使用一个）

Answer 1

似乎多次使用一个匹配组是不可能的。 所以使用的结果

(..regex for group...)*

仍将是一个匹配的组。

第一步，捕获整个标记的代码，然后捕获所有属性：

URL url = new URL("http://stackoverflow.com/");
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder stringBuilder = new StringBuilder();
String inputLine;
while ((inputLine = reader.readLine()) != null) {
    stringBuilder.append(inputLine);
}
String pageContent = stringBuilder.toString();
Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
Matcher matcher = pattern.matcher(pageContent);
while (matcher.find()) {
    String tagName = matcher.group(1);
    String attributes = matcher.group(2);
    System.out.println("tag name: " + tagName);
    System.out.println("     rest of the tag: " + attributes);
    Pattern attributePattern = Pattern.compile("(\\S+)=['\"]{1}([^>]*?)['\"]{1}");
    Matcher attributeMatcher = attributePattern.matcher(attributes);
    while(attributeMatcher.find()) {
        String attributeName = attributeMatcher.group(1);
        String attributeValue = attributeMatcher.group(2);
        System.out.println("         attribute name: " + attributeName + "    value: " + attributeValue);
    }
}

Java正则表达式捕获HTML标记及其属性

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-11-02 00:56:32

Java正则表达式捕获HTML标记及其属性

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-11-02 00:56:32

解决方案1
1 已采纳 2013-11-02 00:56:32