简体   繁体   English

是否需要嵌套正则表达式?

[英]Is nesting regexes ever necessary?

I want to pull out the two numbers 10 and 11 from HTML that looks similar to this, only it has even more noise than what I show here: 我想从看起来与此相似的HTML中提取两个数字10和11,只是它比我在此处显示的噪声更大:

<div a>
<noise=53>
<item=10>
<item=11>
</div>
<div b>
<item=20>
<noise=52>
<item=21>
</div>

I have figured out how to do it by using two regexes: first use 我已经弄清楚如何通过使用两个正则表达式来做到这一点:第一次使用

(?s)(?<=<div a>).*?(?=</div>)

to get stuff in the "div a" section, then use 在“ div a”部分中获取内容,然后使用

(?s)(?<=<item=)[0-9]*

on the result to get the numbers I want. 在结果上得到我想要的数字。 But I can't figure out how to do it in only one regex. 但是我无法弄清楚如何仅使用一个正则表达式。 I have a guess about how I could if only Java let me put *s in lookbehinds, but Java doesn't (and I vaguely understand why not). 我有一个猜测,如果只有Java让我将* s放到后面,而Java没有(我隐约地明白为什么不这样做),那我该怎么办。 Is it possible to do this with only one regex or should I settle for two? 是否可以仅使用一个正则表达式来执行此操作,或者我应该满足两个条件?

I don't think you can get down to one. 我认为您不能陷入困境。 But note that pulling apart HTML is best done with an XML or HTML parser. 但是请注意,最好使用XML或HTML解析器来分解HTML。 YOu can use an XML parser if the HTML is well-formed XHTML; 如果HTML是格式良好的XHTML,则可以使用XML解析器。 otherwise look at http://java-source.net/open-source/html-parsers . 否则请查看http://java-source.net/open-source/html-parsers

I'm not completely certain what you mean by nesting regexes. 我不确定您嵌套正则表达式的意思。 The way this sort of thing is usually approached is to carefully pull off just a bit at a time, like a lexer. 通常采用这种方法的方法是一次像词法分析器一样仔细地一次完成操作。 That way you don't have to try to build everything into one pattern. 这样,您不必尝试将所有内容构建为一个模式。

Instead of using Matcher.matches() , you might go at it by using Matcher.lookingat() , which looks for something from the current start point. 除了使用Matcher.matches() ,您还可以使用Matcher.lookingat() ,该方法从当前起点开始寻找内容。 That way you could test for a bunch of them from the same position. 这样,您可以从同一位置测试一堆。

A similar tactic involves using the one-argument form of Matcher.find() , where you supply a starting character position as the argument. 一种类似的策略涉及使用Matcher.find()的单参数形式,您在其中提供起始字符位置作为参数。

A related feature is the \\G anchor, a zero-width assertion that makes the search start up just where the last match on that same string left off. 一个相关的功能是\\G锚,它是一个零宽度的断言,使搜索从该相同字符串的最后一个匹配项停止的地方开始。 It saves you some bookkeeping that way. 这样可以为您节省一些簿记。

By combining judicious uses of the find(N) and lookingat() methods (plus start() ), perhaps with the \\G assertion, you can build yourself a more flexible and sophisticated processing algorithm than is practicable using a single regular expression alone. 通过结合使用find(N)lookingat()方法(加上start() )的明智使用,也许结合\\G断言,您可以构建自己的灵活性和复杂性的处理算法,这比仅使用单个正则表达式所能实现的要复杂。

It really is a lot easier to use structural logic with regular Java managing your regexes for the pieces than it is to try to do everything in one gargantuan regex. 与尝试在一个庞大的正则表达式中进行所有操作相比,使用结构化逻辑和常规Java管理您的正则表达式要容易得多。 It's much easier to develop, debug, and unit-test that way, too. 这样,开发,调试和单元测试也要容易得多。 Regexes work best at dealing with pieces of strings, not trying to encode an entire parsing algorithm in them. 正则表达式最适合处理字符串,而不是尝试在其中编码整个解析算法。

Plus in Java you can't really do that anyway, since there's no support for recursion within the pattern. 另外,在Java中,您还是不能真正做到这一点,因为该模式中不支持递归。 Perhaps it's just as well, because it encourages you to put the control structures in the outer language, since you can't always put all of what you'd need in the inner one. 也许也一样,因为它鼓励您将控制结构放到外部语言中,因为您不能总是将所有需要的内容放到内部语言中。

import java.util.regex.*;

public class Test
{
  public static void main(String[] args)
  {
    String s = "<div x><item=02><noise=99><item=05></div>\n" + 
        "<div a><noise=53><item=10><item=11><noise=55><item=12></div>\n" + 
        "<item=99>\n" + 
        "<div b><item=20><noise=52><item=21></div>";
    System.out.println(s);
    System.out.println();
    Pattern p = Pattern.compile(
        "(?:<div a>|\\G)(?:[^<]++|<(?!(?:item|/?div)\\b))*+<item=(\\d+)");
    Matcher m = p.matcher(s);
    while (m.find())
    {
      System.out.println(m.group(1));
    }
  }
}

output: 输出:

<div x><item=02><noise=99><item=05></div>
<div a><noise=53><item=10><item=11><noise=55><item=12></div>
<item=99>
<div b><item=20><noise=52><item=21></div>

10
11
12

Breaking that down, we have: 分解,我们有:

  • (?:<div a>|\\\\G) : \\G matches wherever the previous match left off, or at the beginning of the text if there was no previous match. (?:<div a>|\\\\G)\\G匹配上次匹配结束的地方,如果没有先前匹配,则匹配文本的开头。 It's prevented from matching at the beginning by the lookahead in the next part, so the first match starts at the <div a> . 下一部分的前瞻可防止它在开始时匹配,因此第一个匹配从<div a>

  • (?:[^<]++|<(?!(?:item|/?div)\\\\b))*+ : This part consumes whatever lies between the current match position and the next <item=N> tag. (?:[^<]++|<(?!(?:item|/?div)\\\\b))*+ :这部分消耗当前匹配位置和下一个<item=N>标记之间的任何内容。 It gobbles up all characters except < , and < if it's not the beginning of a <item , <div , or </div sequence. 如果不是<item<div</div序列的开头,它将吞噬除<<以外的所有字符。 (The latter two ensure that all <item=N> matches are contained within the same div element; additionally, <div is what prevents \\G from matching at the beginning of the text, and </div prevents matches between div elements, like <item=99> in the example.) (后两个确保所有<item=N>匹配项都包含在同一个div元素中;此外, <div是阻止\\G在文本开头进行匹配的原因,而</div阻止div元素之间的匹配,例如在示例中为<item=99> 。)

  • Finally, <item=(\\\\d+) matches the item tag and captures the number you're after. 最后, <item=(\\\\d+)匹配item标记并捕获您想要的编号。

I think the Sed utility would be more useful than programming with regular expression to extract the text data. 我认为Sed实用程序比使用正则表达式编程提取文本数据更有用。 Try following script in Sed (with option -n ). 尝试在Sed中使用以下脚本(带有-n选项)。

/<div \w>/,/<\/div>/ {
    s/.*item=\([0-9]\+\).*/\1/p
}

如果它是真正的HTML,则可以将其转换为XML,例如通过HTMLTidy或NekoHTML,然后应在其上使用XPath表达式。

甚至不要尝试,您需要一个解析器,许多解析器都可用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM