简体   繁体   中英

Is nesting regexes ever necessary?

I want to pull out the two numbers 10 and 11 from HTML that looks similar to this, only it has even more noise than what I show here:

<div a>
<noise=53>
<item=10>
<item=11>
</div>
<div b>
<item=20>
<noise=52>
<item=21>
</div>

I have figured out how to do it by using two regexes: first use

(?s)(?<=<div a>).*?(?=</div>)

to get stuff in the "div a" section, then use

(?s)(?<=<item=)[0-9]*

on the result to get the numbers I want. But I can't figure out how to do it in only one regex. I have a guess about how I could if only Java let me put *s in lookbehinds, but Java doesn't (and I vaguely understand why not). Is it possible to do this with only one regex or should I settle for two?

I don't think you can get down to one. But note that pulling apart HTML is best done with an XML or HTML parser. YOu can use an XML parser if the HTML is well-formed XHTML; otherwise look at http://java-source.net/open-source/html-parsers .

I'm not completely certain what you mean by nesting regexes. The way this sort of thing is usually approached is to carefully pull off just a bit at a time, like a lexer. That way you don't have to try to build everything into one pattern.

Instead of using Matcher.matches() , you might go at it by using Matcher.lookingat() , which looks for something from the current start point. That way you could test for a bunch of them from the same position.

A similar tactic involves using the one-argument form of Matcher.find() , where you supply a starting character position as the argument.

A related feature is the \\G anchor, a zero-width assertion that makes the search start up just where the last match on that same string left off. It saves you some bookkeeping that way.

By combining judicious uses of the find(N) and lookingat() methods (plus start() ), perhaps with the \\G assertion, you can build yourself a more flexible and sophisticated processing algorithm than is practicable using a single regular expression alone.

It really is a lot easier to use structural logic with regular Java managing your regexes for the pieces than it is to try to do everything in one gargantuan regex. It's much easier to develop, debug, and unit-test that way, too. Regexes work best at dealing with pieces of strings, not trying to encode an entire parsing algorithm in them.

Plus in Java you can't really do that anyway, since there's no support for recursion within the pattern. Perhaps it's just as well, because it encourages you to put the control structures in the outer language, since you can't always put all of what you'd need in the inner one.

import java.util.regex.*;

public class Test
{
  public static void main(String[] args)
  {
    String s = "<div x><item=02><noise=99><item=05></div>\n" + 
        "<div a><noise=53><item=10><item=11><noise=55><item=12></div>\n" + 
        "<item=99>\n" + 
        "<div b><item=20><noise=52><item=21></div>";
    System.out.println(s);
    System.out.println();
    Pattern p = Pattern.compile(
        "(?:<div a>|\\G)(?:[^<]++|<(?!(?:item|/?div)\\b))*+<item=(\\d+)");
    Matcher m = p.matcher(s);
    while (m.find())
    {
      System.out.println(m.group(1));
    }
  }
}

output:

<div x><item=02><noise=99><item=05></div>
<div a><noise=53><item=10><item=11><noise=55><item=12></div>
<item=99>
<div b><item=20><noise=52><item=21></div>

10
11
12

Breaking that down, we have:

  • (?:<div a>|\\\\G) : \\G matches wherever the previous match left off, or at the beginning of the text if there was no previous match. It's prevented from matching at the beginning by the lookahead in the next part, so the first match starts at the <div a> .

  • (?:[^<]++|<(?!(?:item|/?div)\\\\b))*+ : This part consumes whatever lies between the current match position and the next <item=N> tag. It gobbles up all characters except < , and < if it's not the beginning of a <item , <div , or </div sequence. (The latter two ensure that all <item=N> matches are contained within the same div element; additionally, <div is what prevents \\G from matching at the beginning of the text, and </div prevents matches between div elements, like <item=99> in the example.)

  • Finally, <item=(\\\\d+) matches the item tag and captures the number you're after.

I think the Sed utility would be more useful than programming with regular expression to extract the text data. Try following script in Sed (with option -n ).

/<div \w>/,/<\/div>/ {
    s/.*item=\([0-9]\+\).*/\1/p
}

如果它是真正的HTML,则可以将其转换为XML,例如通过HTMLTidy或NekoHTML,然后应在其上使用XPath表达式。

甚至不要尝试,您需要一个解析器,许多解析器都可用。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM