简体   繁体   English

如何使用正则表达式在LaTeX中查找嵌套标签

[英]How to find nested tags in LaTeX with a regex

I'm trying to extract theorems from LaTeX source with java. 我正在尝试使用Java从LaTeX源中提取定理。 My code almost works, but one test case is failing – nested theorems. 我的代码几乎可以运行,但是一个测试用例失败了–嵌套定理。

@Test
public void testNestedTheorems() {
    String source = "\\begin{theorem}" +
                    "this is the outer theorem" +
                    "\\begin{theorem}" +
                    "this is the inner theorem" +
                    "\\end{theorem}" +
                    "\\end{theorem}";

    LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
    extractor.parse();

    ArrayList<String> theorems = extractor.getTheorems();
    assertNotNull(theorems);
    assertEquals(2, theorems.size()); // theorems.size() is 1
    assertEquals("this is the outer theorem", theorems.get(0)); 
    assertEquals("this is the inner theorem", theorems.get(1)); 
}

Here's my theorem extractor which is called by LatexTheoremProofExtractor#parse : 这是我的定理提取器,被LatexTheoremProofExtractor#parse调用:

private void extractTheorems() {

    // If this has been called before, return
    if(theorems != null) {
        return;
    }

    theorems = new ArrayList<String>();

    final Matcher matcher = THEOREM_REGEX.matcher(source);

    // Add trimmed matches while you can find them
    while (matcher.find()) {
        theorems.add(matcher.group(1).trim());
    }   
}

and THEOREM_REGEX is defined as follows: THEOREM_REGEX定义如下:

private static final Pattern THEOREM_REGEX = Pattern.compile(Pattern.quote("\\begin{theorem}")
                                                    + "(.+?)" + Pattern.quote("\\end{theorem}"));

Does anyone have recommendations for dealing with the nested tags? 有没有人建议处理嵌套标签?

If you only want to match doubly nested theorem s, you can write down an explicit regular expression for it. 如果只想匹配双嵌套theorem ,则可以为其编写一个显式正则表达式。 I guess it would look something like this. 我想看起来像这样。

Pattern.compile(
      Pattern.quote("\\begin{theorem}")
        + "("
            + "(.+?)"
            + Pattern.quote("\\begin{theorem}")
                + "(.+?)"
            + Pattern.quote("\\end{theorem}")
        + ")*"
     + Pattern.quote("\\end{theorem}"));

(This code should give you the idea but it is untested an probably does not work like written. This is not the point I want to make.) (这段代码应该可以为您提供想法,但是未经测试,可能无法像编写的那样工作。这不是我想讲的重点。)

You can continue this for triple-nesting and so forth for any fixed number of nesting you want. 您可以继续此操作以进行三重嵌套,以此类推,以获取所需的任何固定数量的嵌套。 Needless to say that it will become rather awkward pretty soon. 不用说它很快就会变得很尴尬。

However, if your goal is to match arbitrary deep nestings then it is simply impossible to do with regular expressions. 但是,如果您的目标是匹配任意深层嵌套,则使用正则表达式根本不可能 The problem is that the language you want to match is not regular (but context-free). 问题是您要匹配的语言不是常规语言(而是上下文无关)。 Context-free languages are strictly more powerful than regular languages and regular expressions can only match regular languages precisely. 上下文无关的语言严格比常规语言更强大,并且常规表达式只能精确匹配常规语言。 In general, you will need to construct a push-down automaton if you want to match a context-free language. 通常,如果要匹配上下文无关的语言,则需要构造一个下推式自动机。

Further reading: 进一步阅读:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM