捕获组之前的反向引用

Question

I'm trying to match the text Page x of x so I can identify the last page in a document. 我正在尝试匹配文本Page x of x以便可以标识文档中的最后一页。

I've been playing around with capture groups, and found the regex Page (\\d*) of \\1 almost works, except that it also matches things such as Page 2 of 25 . 我一直在与捕获组一起玩耍，发现Page (\\d*) of \\1的正则表达式Page (\\d*) of \\1几乎可以工作，除了它还与诸如Page 2 of 25东西匹配。 Ideally, I'd like to use Page \\1 of (\\d*) , but I guess the regex engine doesn't support that. 理想情况下，我想使用Page \\1 of (\\d*) ，但是我想正则表达式引擎不支持该功能。

I should also note that this is part of an OCR job, so I can't rely on string endings, since occasionally I pick up extra characters ( Page 2 of 25la , for example) 我还应注意，这是OCR作业的一部分，因此我不能依赖字符串结尾，因为有时我会捡起多余的字符（例如Page 2 of 25la ）

Anyone have any tips? 有人有提示吗？

Answer 1

Use \\d+ instead of \\d* . 使用\\d+代替\\d* 。 Also check for the end of digit using lookaround as well. 还要使用环视检查数字的末尾。

Page (\d+) of \1(?=\D)

Answer 2

Add a look ahead: 向前看：

Page (\d*) of \1(?=\D|\Z)

The look ahead will match when the input following the back reference is a "non digit" character or end of input. 当后向引用后面的输入是“非数字”字符或输入结尾时，前瞻将匹配。

Answer 3

But instead of a extra character like a at the end you could get an extra digit. 但是，而不是一个多余的角色就像a在最后，你可以得到一个额外的数字。 And then you could be at the last page of your doc but the regexpr does not match. 然后，您可能位于文档的最后一页，但是regexpr不匹配。

Maybe the best way to attack this problem is to start with the simple regexp 解决此问题的最佳方法可能是从简单的正则表达式开始

Page\s+(\d+)\s+of\s+(\d+)

正则表达式可视化

Debuggex Demo Debuggex演示

and iterate over all occurances to somehow overcome this nasty extra character problem and get the max page number right. 并遍历所有事件以某种方式克服此讨厌的额外字符问题并获得正确的最大页码。 And after it is clear how many pages there are, then to check where group 1 equals group 2. 在确定有多少页之后，然后检查组1等于组2的位置。

I included \\s+ in my regexp. 我在正则表达式中包含了\\s+ 。 This should also be necessary due to your data. 由于您的数据，这也应该是必要的。

But in the end there is only a chance that it will work depending on the accuracy of the OCR processing. 但是最后，根据OCR处理的准确性，它只能工作的机会。

捕获组之前的反向引用

问题描述

3 个解决方案

解决方案1
2 已采纳 2014-03-10 16:26:38

解决方案2
1 2014-03-10 16:26:23

解决方案3
1 2014-03-10 16:52:31

捕获组之前的反向引用

问题描述

3 个解决方案

解决方案1 2 已采纳 2014-03-10 16:26:38

解决方案2 1 2014-03-10 16:26:23

解决方案3 1 2014-03-10 16:52:31

解决方案1
2 已采纳 2014-03-10 16:26:38

解决方案2
1 2014-03-10 16:26:23

解决方案3
1 2014-03-10 16:52:31