简体   繁体   中英

Using Condition in Regular Expressions

Source:

<TD>
    <A HREF="/home"><IMG SRC="/images/home.gif"></A>
    <IMG SRC="/images/spacer.gif">
    <A HREF="/search"><IMG SRC="/images/search.gif"></A>
    <IMG SRC="/images/spacer.gif">
    <A HREF="/help"><IMG SRC="/images/help.gif"></A>
</TD>

Regex:

  (<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>(?(1)\s*</[Aa]>)

Result:

<A HREF="/home"><IMG SRC="/images/home.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/search"><IMG SRC="/images/search.gif"></A>
<IMG SRC="/images/spacer.gif">
<A HREF="/help"><IMG SRC="/images/help.gif"></A>

what's the "?(1)" mean?

When I run it in Java ,it cause a exception: java.util.regex.PatternSyntaxException,the "?(1)" can't be recognized.

The explanation in the book is :

This pattern requires explanation. (<[Aa]\\s+[^>]+>\\s*)? matches an opening <A> or <a> tag (with any attributes that may be present), if present (the closing ? makes the expression optional). <[Ii][Mm][Gg]\\s+[^>]+> then matches the <IMG> tag (regardless of case) with any of its attributes. (?(1)\\s*</[Aa]>) starts off with a condition: ?(1) means execute only what comes next if backreference 1 (the opening <A> tag) exists (or in other words, execute only what comes next if the first <A> match was successful). If (1) exists, then \\s*</[Aa]> matches any trailing whitespace followed by the closing </A> tag.

The syntax is correct. The strange looking (?....) sets up a conditional. This is the regular expression syntax for an if...then statement. The (1) is a back-reference to the capture group at the beginning of the regex, which matches an html <a> tag, if there is one since that capture group is optional. Since the back-reference to the captured tag follows the "if" part of the regex, what it is doing is making sure that there was an opening <a> tag captured before trying to match the closing one. A pretty clever way of making both tags optional, but forcing both when the first one exists. That's how it's able to match all the lines in the sample text even though some of them just have <img> tags.

As to why it throws an exception in your case, most likely the flavor of regex you're using doesn't support conditionals. Not all do.

EDIT: Here's a good reference on conditionals in regular expressions: http://www.regular-expressions.info/conditional.html

What you're looking at is a conditional construct, as Bryan said, and Java doesn't support them. The parenthesized expression immediately after the question mark can actually be any zero-width assertion, like a lookahead or lookbehind, and not just a reference to a capture group. (I prefer to call those back-assertions , to avoid confusion. A back-reference matches the same thing the capture group did, but a back-assertion just asserts that the capture group matched something .)

I learned about conditionals when I was working in Perl years ago, but I've never missed them in Java. In this case, for example, a simple alternation will do the trick:

(?i)<a\s+[^>]+>\s*<img\s+[^>]+>\s*</a]>|<img\s+[^>]+>

One advantage of the conditional version is that you can capture the IMG tag with a single capture group:

(?i)(<a\s+[^>]+>\s*)?(<img\s+[^>]+>)(?(1)\s*</a>)

In the alternation version you have to have a capturing group for each alternative, but that's not as important in Java as it is in Perl, with all its built-in regex magic. Here's how I would pluck the IMG tags in Java:

Pattern p = Pattern.compile(
  "<a\\s+[^>]+>\\s*(<img\\s+[^>]+>)\\s*</a>|(<img\\s+[^>]+>)"
  Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.start(1) != -1 ? m.group(1) : m.group(2));
}

Could it be a non capturing group as described here:

There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by groupCount. Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total. (You'll see examples of non-capturing groups later in the section Methods of the Pattern Class.)

Java Regex Tutorial

The short answer: it doesn't mean anything. The problem lies in this whole snippet:

(?(1)\s*)

() creates a back reference, so you can reuse any text matched inside. They also allow you to apply operators to everything inside of them (but this isn't done in your example).

? means that the item before it should be matched if it's there but it is also OK if it's not. This simply doesn't make sense when it appears after (

(?: MoreTextHere ) Can be used to speed up RegExs when you don't need to reuse the matched text. But that still doesn't really make sense, why match a 1 when your input is HTML?

Try:

(?:<[Aa]\s+[^>]+>\s*)?<[Ii][Mm][Gg]\s+[^>]+>

You never said exactly what you were trying to match so if this answer doesn't satisfy you, please explain what you're trying to do with RegEx.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM