简体   繁体   English

RegEx-匹配<a>Java</a>中的整个<a>标签</a>

[英]RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code: 我正在尝试使用以下代码在Java中使用正则表达式来匹配<a href="**something**">

Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
                Matcher matcher = regex.matcher(string);
                string= matcher.replaceAll("");

I'm not really familiar with regex. 我对正则表达式不是很熟悉。 What am I doing wrong? 我究竟做错了什么? Thanks 谢谢

If you just want to find the start tag you could use: 如果您只想查找开始标签,则可以使用:

"<a(?=[>\\s])[^>]*>"

If you are trying to get the href attribute it would be better to use: 如果您尝试获取href属性,则最好使用:

"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"

This would capture the link into capturing group 2. 这会将链接捕获到捕获组2中。

To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag: 为了让您了解为什么人们总是说“不要尝试使用正则表达式解析HTML”,下面是一个简化的正则表达式,用于匹配<a>标签:

<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>

It actually is possible to match a tag with a regular expression. 实际上,可以将标签与正则表达式匹配。 It just isn't as easy as most people expect. 这并不像大多数人期望的那么容易。

All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. 另一方面,所有HTML都不是“正则”的,因此您不能使用正则表达式来实现。 (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.) (实际上,许多/大多数语言中的“ regex”支持都比“ regular”更强大,但很少有能力像HTML那样处理平衡的结构。)

Here's a breakdown of what the above expression does: 以下是上述表达式的功能细分:

<\s*             < and possibly some spaces
a                "a"
(?:              0 or more...
  \s+              some spaces
  [a-z]+           attribute name (simplified)
  (?:              and maybe...
    \s*=\s*          an equal sign, possibly with surrounding spaces
    (?:              and one of:
      [a-z0-9]+        - a simple attribute value (simplified)
      |"[^"]*"         - a double-quoted attr value
      |'[^']*'         - a single quoted atttr value
    )
  )?
)*
\s*>             possibly more spaces and then >

(The comments at the start of each group also talk about the operator at the end of the group, or even in the group.) (每个组开始处的注释还会讨论组末尾甚至组中的运算符。)

There are possibly other simplifications here -- I wrote this from memory, not from the spec. 这里可能还有其他简化方法-我是从内存而不是从规范中写出来的。 Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input. 即使您遵循规范,浏览器的容错能力也更高,并且会接受各种无效输入。

you can just match against: 您可以匹配:

"<a[^>]*>"

If the * is "greedy" in java (what I think it is, this is correct) But you cannot match < a whatever="foo" > with that, because of the whitespaces. 如果在Java中*是“贪婪的”(我认为这是正确的),但是由于空格,您不能将< a whatever="foo" > what < a whatever="foo" >与此匹配。

Although the following is better, but more complicated to understand: 虽然以下比较好,但是了解起来比较复杂:

"<\\s*a\\s+[^>]*>"

(The double \\\\ is needed because \\ is a special char in a java strings) (需要双\\\\ ,因为\\是Java字符串中的特殊字符)

This handles optional whitespaces before a and at minimum one whitespace after a . 它处理在a之前的可选空格,在a之后a至少一个空格。 So you don't match <abcdef> which is not a correct a tag. 因此,您不匹配<abcdef> ,这不是正确的标记。 (I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.) your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that. (我假设您的标签位于一行中,并且您未启用多行模式。否则,它会变得更加复杂。)最后一个*[^/]*?>似乎有点奇怪,也许没有造成这种情况的原因。

Ok lets check what you are doing: 好,让我们检查一下您在做什么:

<([a-z]+) *[^/]*?>

<([a-z]+)

match something that contains an < followed by a [az] at least one time. 至少匹配一次包含<后跟[az]的内容。 This is grouped by the brackets. 按括号分组。

Now you use a * which means the defined group ([az])* may appear multiple time, or not. 现在,您使用*表示已定义的组([az])*可能会出现多次,也可能不会出现多次。

[^/]*

This means now match everything, but a / or nothing (because of the * ) 这意味着现在匹配所有内容,但不匹配/或不匹配)(因为*

The question mark is just wrong, not sure how this is interpreted. 问号是错误的,不确定如何解释。

Last char > matched as last element, which must appear. Last char >作为最后一个元素匹配,必须出现。

To sum up, your expression is just wrong and cannot work :) 总而言之,您的表达是错误的,无法正常工作:)

Take a look at: http://www.regular-expressions.info/ 看看: http : //www.regular-expressions.info/

This is a good starting point. 这是一个很好的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM