简体   繁体   English

在正则表达式中使用条件

[英]Using Condition in Regular Expressions

Source: 资源:

<TD>
    <A HREF="/home"><IMG SRC="/images/home.gif"></A>
    <IMG SRC="/images/spacer.gif">
    <A HREF="/search"><IMG SRC="/images/search.gif"></A>
    <IMG SRC="/images/spacer.gif">
    <A HREF="/help"><IMG SRC="/images/help.gif"></A>
</TD>

Regex: 正则表达式:

  (<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)

Result: 结果:

<A HREF="/home"><IMG SRC="/images/home.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/search"><IMG SRC="/images/search.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/help"><IMG SRC="/images/help.gif"></A>

what's the "?(1)" mean? “?(1)”是什么意思?

When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the "?(1)" can't be recognized. 当我在Java中运行它时,它将导致异常:java.util.regex.PatternSyntaxException,无法识别“?(1)”。

The explanation in the book is : 书中的解释是:

This pattern requires explanation. (<[Aa]\\s+[^>]+>\\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \\s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.

The syntax is correct. 语法正确。 The strange looking (?....) sets up a conditional. 看起来很奇怪(?....)设置了条件。 This is the regular expression syntax for an if...then statement. 这是if ... then语句的正则表达式语法。 The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. (1)是对正则表达式开头的捕获组的反向引用,如果有一个,则匹配html <a>标记,因为该捕获组是可选的。 Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. 由于对捕获到的标记的反向引用是在正则表达式的“ if”部分之后进行的,因此它的工作是确保在尝试与结束标记匹配之前捕获了一个开始的<a>标记。 A pretty clever way of making both tags optional, but forcing both when the first one exists. 一种使两个标签都可选的聪明方法,但是当第一个标签存在时将它们都强制。 That's how it's able to match all the lines in the sample text even though some of them just have <img> tags. 这样,即使其中一些仅带有<img>标记,它也能够匹配示例文本中的所有行。

As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. 至于为什么在您的情况下引发异常,您使用的正则表达式很可能不支持条件语句。 Not all do. 并非全部。

EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html 编辑:这是对正则表达式中条件的良好参考: http : //www.regular-expressions.info/conditional.html

What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. 正如Bryan所说,您正在查看的是一个条件构造,而Java不支持它们。 The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. 紧接在问号后面的带括号的表达式实际上可以是任何零宽度的断言,例如向前或向后查找,而不仅仅是对捕获组的引用。 (I prefer to call those back-assertions , to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something .) (为了避免混淆,我更喜欢称其为后断言后向引用与捕获组所做的相同,但是反向断言只是断言捕获组已匹配某项 。)

I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. 多年前,当我在Perl上工作时,我就了解了条件条件,但是我从来没有错过Java。 In this case, for example, a simple alternation will do the trick: 例如,在这种情况下,一个简单的替换就可以解决问题:

(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>

One advantage of the conditional version is that you can capture the IMG tag with a single capture group: 条件版本的一个优点是,您可以使用一个捕获组来捕获IMG标签:

(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)

In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. 在替代版本中,每个替代方案都必须具有捕获组,但这在Java中不像在Perl中那样重要,因为它具有所有内置的正则表达式魔术。 Here's how I would pluck the IMG tags in Java: 这是我在Java中选择IMG标签的方法:

Pattern p = Pattern.compile(
  "<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
  Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}

Could it be a non capturing group as described here: 可能是此处所述的非捕获组:

There is also a special group, group 0, which always represents the entire expression. 还有一个特殊的组,组0,它始终代表整个表达式。 This group is not included in the total reported by groupCount. 该组不包括在groupCount报告的总数中。 Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total. (You'll see examples of non-capturing groups later in the section Methods of the Pattern Class.) 以(?为开头的组是纯的,不捕获的组,它们不捕获文本并且不计入该组的总数。(您将在模式类的方法部分中稍后看到不捕获的组的示例。)

Java Regex Tutorial Java Regex教程

The short answer: it doesn't mean anything. 简短的答案:这没有任何意义。 The problem lies in this whole snippet: 问题在于整个片段:

(?(1)\s*)

() creates a back reference, so you can reuse any text matched inside. ()创建反向引用,因此您可以重复使用内部匹配的任何文本。 They also allow you to apply operators to everything inside of them (but this isn't done in your example). 它们还允许您将运算符应用于其中的所有内容(但是在您的示例中未完成)。

? means that the item before it should be matched if it's there but it is also OK if it's not. 表示它之前的项目应该匹配,如果没有,也可以。 This simply doesn't make sense when it appears after ( 当它出现在之后

(?: MoreTextHere ) Can be used to speed up RegExs when you don't need to reuse the matched text. (?: MoreTextHere )当您不需要重用匹配的文本时,可用于加速RegExs。 But that still doesn't really make sense, why match a 1 when your input is HTML? 但这仍然没有任何意义,为什么当您输入的是HTML时为什么匹配1?

Try: 尝试:

(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>

You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx. 您从未确切地说过您要匹配的内容,因此,如果此答案不能使您满意,请解释一下您要使用RegEx做些什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM