简体   繁体   English

捕获组之前的反向引用

[英]Backreference before capture group

I'm trying to match the text Page x of x so I can identify the last page in a document. 我正在尝试匹配文本Page x of x以便可以标识文档中的最后一页。

I've been playing around with capture groups, and found the regex Page (\\d*) of \\1 almost works, except that it also matches things such as Page 2 of 25 . 我一直在与捕获组一起玩耍,发现Page (\\d*) of \\1的正则表达式Page (\\d*) of \\1几乎可以工作,除了它还与诸如Page 2 of 25东西匹配。 Ideally, I'd like to use Page \\1 of (\\d*) , but I guess the regex engine doesn't support that. 理想情况下,我想使用Page \\1 of (\\d*) ,但是我想正则表达式引擎不支持该功能。

I should also note that this is part of an OCR job, so I can't rely on string endings, since occasionally I pick up extra characters ( Page 2 of 25la , for example) 我还应注意,这是OCR作业的一部分,因此我不能依赖字符串结尾,因为有时我会捡起多余的字符(例如Page 2 of 25la

Anyone have any tips? 有人有提示吗?

Use \\d+ instead of \\d* . 使用\\d+代替\\d* Also check for the end of digit using lookaround as well. 还要使用环视检查数字的末尾。

Page (\d+) of \1(?=\D)

Add a look ahead: 向前看:

Page (\d*) of \1(?=\D|\Z)

The look ahead will match when the input following the back reference is a "non digit" character or end of input. 当后向引用后面的输入是“非数字”字符或输入结尾时,前瞻将匹配。

But instead of a extra character like a at the end you could get an extra digit. 但是,而不是一个多余的角色就像a在最后,你可以得到一个额外的数字。 And then you could be at the last page of your doc but the regexpr does not match. 然后,您可能位于文档的最后一页,但是regexpr不匹配。

Maybe the best way to attack this problem is to start with the simple regexp 解决此问题的最佳方法可能是从简单的正则表达式开始

Page\s+(\d+)\s+of\s+(\d+)

正则表达式可视化

Debuggex Demo Debuggex演示

and iterate over all occurances to somehow overcome this nasty extra character problem and get the max page number right. 并遍历所有事件以某种方式克服此讨厌的额外字符问题并获得正确的最大页码。 And after it is clear how many pages there are, then to check where group 1 equals group 2. 在确定有多少页之后,然后检查组1等于组2的位置。

I included \\s+ in my regexp. 我在正则表达式中包含了\\s+ This should also be necessary due to your data. 由于您的数据,这也应该是必要的。

But in the end there is only a chance that it will work depending on the accuracy of the OCR processing. 但是最后,根据OCR处理的准确性,它只能工作的机会。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 JavaScript正则表达式中的反向引用非捕获组 - Backreference non capture group in javascript regex 正则表达式:数字与对捕获组的反向引用 - Regex: a number vs. a backreference to a capture group 使用先前的反向引用作为命名捕获组的名称 - Use previous backreference as name of named capture group Perl-后面跟随捕获组时,反向引用不可用? - Perl - backreference not available when capture group is followed by? 是否可以将匹配结果的反向引用用作另一个捕获组的匹配? - Is it possible to use the backreference of match result as a match of another capture group? 捕获组前或捕获组后的正则表达式,带单个捕获组 - Regex before capture group OR after capture group, with single capture group 对分支重置组的反向引用 - A backreference to a branch reset group 不能在 re.sub() repr 表达式中的 function 调用中使用 '\1' 反向引用来捕获组 - Can't use '\1' backreference to capture-group in a function call in re.sub() repr expression 在 gsub R 函数中使用与号 (&) 进行模式替换(对捕获组的反向引用) - Using ampersand (&) inside gsub R function for pattern replacement (backreference to capture group) RegExMatch具有文字字符串的反向引用组 - RegExMatch backreference group with literal string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM