简体   繁体   English

正则表达式匹配很慢

[英]Regex matching very slow

I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format). 我正在尝试解析PDF以从中提取文本(请不要建议任何库来执行此操作,因为这是学习格式的一部分)。
I have already handled deflating it to put it in the alphanumeric format. 我已经处理了缩小它以使用字母数字格式。 I now need to extract the text from the text blocks. 我现在需要从文本块中提取文本。
So, my current pattern is BT.*?\\((.*?)\\).*?ET (with DOTMATCHALL set) to match something like: 所以,我当前的模式是BT.*?\\((.*?)\\).*?ET (设置DOTMATCHALL)匹配如下:

BT
   /F13 12 Tf
   288 720 Td
   (ABC) Tj
ET

The only bit I want is the text ABC in the brackets. 我想要的唯一一点就是方括号中的文字ABC。
The above is only formatted like that to make it clear to see. 上面的格式只是为了清楚地看到。 In the deflated text it may be all in one line, it may not be. 在缩小的文本中,它可能全部在一行中,也可能不是。 There is no gurantee that the BT/ET will be at the start of a line. 没有保证BT / ET将在一条线的起点。 There may be spaces and text before/after the bracketed section, there may not be. 在括号内部分之前/之后可能有空格和文字,可能没有。 There will however, be only one bracketed section per BT/ET block. 但是,每个BT / ET块只有一个括号内的部分。

The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times. 上面的模式有效,但实际上很慢,我认为这是因为正则表达式库无法匹配BT和(ABC)之间的文本多次匹配的模式。
The regex is pre-compiled in an attempt to speed it up, but it seems negligible. 正则表达式是预编译的,试图加快速度,但似乎可以忽略不计。

How may I speed this up? 我怎么能加快速度呢?

How many of these blocks might appear in a document? 这些块中有多少可能出现在文档中?

Often slow Regex execution is the result of catastrophic backtracking, as described here: http://www.regular-expressions.info/catastrophic.html 通常缓慢的正则表达式执行是灾难性回溯的结果,如下所述: http//www.regular-expressions.info/catastrophic.html

I don't know what regex technology you're using, but you could try to use lookaround assertions, as described here: http://www.regular-expressions.info/lookaround.html 我不知道您正在使用什么样的正则表达式技术,但您可以尝试使用外观断言,如下所述: http//www.regular-expressions.info/lookaround.html

These allow you to first just match what you want, ABC inside parentheses, and then validate that it is preceded by some value and followed by some other value. 这些允许您首先匹配您想要的内容, ABC在括号内,然后验证它前面有一些值,后跟一些其他值。

Are you sure the regex is correct and pulls out ABC as a match? 你确定正则表达式是否正确并将ABC作为匹配? What language's regex engine is this? 这是什么语言的正则表达式引擎? Using my regular expression debugger shows that: 使用我的正则表达式调试器显示:

"BT.*?((.*?)).*?ET" doesn't pull out ABC and in fact must find the string 'ET' then backtrack back to find everything else. "BT.*?((.*?)).*?ET"不会拉出ABC,实际上必须找到字符串'ET'然后回溯以找到其他所有内容。

"BT.*?\\\\((.*?)\\\\).*?ET" works as expected with a single pass left to right. "BT.*?\\\\((.*?)\\\\).*?ET"按预期工作,从左到右单次传递。

You can't just parse the PDF with a regex to extract the text. 您不能只使用正则表达式解析PDF以提取文本。 In most cases the text in inside compressed binary blobs or encoded. 在大多数情况下,文本内部压缩二进制blob或编码。 A PDF with the text shown like this is very much the exception. 带有这样的文字的PDF是非常例外的。

There's not really enough info for a definite answer--or maybe you're assuming we know more about PDF than you do. 确切的答案还没有足够的信息 - 或者你假设我们比你更了解PDF。 Are there always parenthesized chunks inside these BT...ET sections? 这些BT...ET部分中是否总是有括号内的块? Is there always only one of them? 总是只有其中一个吗? Is the BT or ET always at the beginning of a line? BTET总是在一条线的开头吗? If so, I would suggest 如果是这样,我建议

(?m)^BT[^()]*\((.*?)\)[^()]*?^ET

If I knew how PDF represented literal parentheses, I could probably come up with something more efficient. 如果我知道PDF如何代表字面括号,我可能会想出更高效的东西。

EDIT: According to the PDF spec, literal parentheses have to be escaped with a backslash, and there are a bunch of other backslash-escape sequences. 编辑:根据PDF规范,文字括号必须用反斜杠转义,并有一堆其他反斜杠转义序列。 So try this: 试试这个:

(?s)\bBT\b[^()]*\(((?:[^()\\]*(?:\\.[^()\\]*)*))\)

This part-- [^()\\\\]*(?:\\\\.[^()\\\\]*)* --matches a block of text which may contain escaped characters (including parens), but not unescaped parens. 这部分 - [^()\\\\]*(?:\\\\.[^()\\\\]*)*匹配一个文本块,其中可能包含转义字符(包括parens),但不包含未转义的parens。 I know it looks ugly, but it's the most efficient way, since Python doesn't support atomic groups or possessive quantifiers. 我知道它看起来很难看,但它是最有效的方式,因为Python不支持原子组或占有量词。

(?s) allows . (?s)允许. to match newlines, and \\bBT\\b makes sure the BT isn't part of a longer "word". 匹配换行符, \\bBT\\b确保BT不是更长的“单词”的一部分。 I'm reasonably confident that this is all I need to match all of the actual text content, so I don't bother matching the stuff after the closing paren. 我有理由相信这是我需要匹配所有实际文本内容的所有内容,所以我不打算在关闭后填充匹配的东西。

here's one without regex. 这是一个没有正则表达式的人。 simple string parsing using Python internals. 使用Python内部进行简单的字符串解析。

>>> xtract="""
... BT
...    /F13 12 Tf
...    288 720 Td
...    (ABC) Tj
... ET
...
... """
>>> for chunk in xtract.split("ET"):
...     if "BT" in chunk:
...         for brace in chunk.split(")"):
...             if "(" in brace:
...                  print brace[brace.find("(")+1:]
...
ABC

由于BTET之间只有一个括号内的表达式,因此您可以尝试使用以下正则表达式来提高速度:

r"(?s)\bBT\b[^(]*\(([^)]*)\).*?\bET\b"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM