[英]Java regex to capture html tag and its attributes
(我知道正则表达式不是处理html的推荐方法,但这是我的工作)
我需要Java中的正则表达式来捕获html标签及其属性。 我正在尝试使用一个正则表达式使用组来实现这一点。 我希望这个正则表达式可以正常工作:
<(?!!)(?!/)\s*(\w+)(?:\s*(\S+)=['"]{1}[^>]*?['"]{1})*\s*>
< the tag starts with <
(?!!) I dont't want comments
(?!/) I dont't want closing tags
\s* any number of white spaces
(\w+) the tag
(?: do not capture the following group
\s* any number of white spaces before the first attribute
(\S+) capture the attributes name
=['"]{1}[^>]*?['"]{1} the ="bottm" or ='bottm' etc.
)* close the not-capturing group, it can occure multiple times or zero times
\s* any white spaces before the closing of the tag
> close the tag
我期望标签的结果如下:
<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"
但结果是:
group(1) = "div"
group(2) = "class"
似乎不可能多次捕获一个组(...)*,这是正确的吗?
至于现在我使用reg ex像:
<(?!!)(?!/)\s*(\w+) (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (...){0,1} (...){0,1} ... \s*>
我多次对该属性重复捕获组,并得到如下结果:
<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"
group(4) = null
group(5) = null
group(6) = null
...
我还可以使用哪些其他方法? (我可以使用多个正则表达式,但最好只使用一个)
似乎多次使用一个匹配组是不可能的。 所以使用的结果
(..regex for group...)*
仍将是一个匹配的组。
第一步,捕获整个标记的代码,然后捕获所有属性:
URL url = new URL("http://stackoverflow.com/");
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder stringBuilder = new StringBuilder();
String inputLine;
while ((inputLine = reader.readLine()) != null) {
stringBuilder.append(inputLine);
}
String pageContent = stringBuilder.toString();
Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
Matcher matcher = pattern.matcher(pageContent);
while (matcher.find()) {
String tagName = matcher.group(1);
String attributes = matcher.group(2);
System.out.println("tag name: " + tagName);
System.out.println(" rest of the tag: " + attributes);
Pattern attributePattern = Pattern.compile("(\\S+)=['\"]{1}([^>]*?)['\"]{1}");
Matcher attributeMatcher = attributePattern.matcher(attributes);
while(attributeMatcher.find()) {
String attributeName = attributeMatcher.group(1);
String attributeValue = attributeMatcher.group(2);
System.out.println(" attribute name: " + attributeName + " value: " + attributeValue);
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.