简体   繁体   English

括号可能不平衡的正则表达式

[英]regex where parenthesis might not be balanced

I have to pull some text out of a PDF stream as a string. 我必须将一些文本作为字符串从PDF流中提取出来。 This stream will contain both the markup to describe the appearance of the text, and the text itself. 此流将包含描述文本外观的标记和文本本身。 The string that I receive that my regex will have to run on will never contain any carriage returns or line feeds. 我收到的正则表达式必须在其上运行的字符串将永远不会包含任何回车符或换行符。 The areas of text that I am interested in will always be inside parenthesis (and there will potentially be parenthesis inside parenthesis), and after the final parenthesis there will be the letters 'Tj'. 我感兴趣的文本区域将始终在括号内(并且括号内可能会有括号),最后的括号后将出现字母“ Tj”。 In short, what I am after will always follow the convention: 简而言之,我追求的将始终遵循惯例:

(.....) Tj

At the moment, the regex I have is working, as long as the parenthesis are all balanced: 目前,只要括号之间的平衡,我使用的正则表达式就可以正常工作:

\((?:[^()]|(?'paren'\()|(?'-paren'\)))+(?(paren)(?!))\)

However if the text itself contains unbalanced parethesis, this regex will not pull what I want, and I am not sure how to change it to be able to handle unbalanced parenthesis. 但是,如果文本本身包含不平衡的括号,则此正则表达式将无法实现我想要的功能,并且我不确定如何更改它以能够处理不平衡的括号。

Here is a sample of what would be considered a 'normal' string: 以下是被视为“正常”字符串的示例:

q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj

So obviously, I want to get the string 'RE: Request for Additional Information' out of that. 所以很明显,我想从中得到字符串“ RE:Request for Additional Information”。

and here is an example case that my regex will fail on (I have added unbalanced parenthesis): 这是一个例子,我的正则表达式将失败(我添加了不平衡的括号):

q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj  0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj  0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj  

There are also empty sets of parenthesis in here, that look like: 这里也有空括号,如下所示:

() Tj

These represent carriage returns and line feeds when the PDF is rendered. 这些代表呈现PDF时的回车和换行符。 Any help is appreciated. 任何帮助表示赞赏。 Thank you in advance. 先感谢您。

--- UPDATE to answer questions below ---更新以回答以下问题

Any type of user input can be placed between the open and closing parenthesis. 任何类型的用户输入都可以放在左括号和右括号之间。 I want to extract all content as provided, however that may be, even if the user forgot to balance their parenthesis. 我想提取所提供的所有内容,但是即使用户忘记平衡括号也是如此。 The only guarantee is that the text between the parenthesis is user input, but however they input the text is up to them, so it does NOT follow a predefined format such as ([abbrev]: [content]) , etc. The content is only guaranteed to be between an open parens, a close parens, and after the close parens will be the letters 'Tj'. 唯一的保证是括号之间的文本是用户输入的,但是他们输入的文本取决于他们,因此它不遵循([abbrev]:[content])等预定义格式。仅保证在开放式括号,封闭式括号之间,并且在封闭式括号之后将是字母“ Tj”。

As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. 正如我在评论中提到的那样,.NET无法帮到您,但是我可以给您一个可能有用的表达式。 I think the solution requires "negative lookahead", and perl offers that. 我认为该解决方案需要“负前瞻”,而perl提供了这一点。 The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. 问题是我已经很久没有使用过perl了,我忘记了如何使它进入整个流程。 If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples: 如果我将流分成“(...)Tj”大块,每个大块都放在单独的行中,则我的脚本将在您的所有示例中运行:

$ cat pdf_data_line_by_line.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj
0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj
0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj
0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj
0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj
0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
#!/usr/bin/perl
while (<>) {
   # find some text
   if ( /[^(]*\((?!\)).*\) Tj/ ) {
      # strip off leading junk
      s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
      # output saved part of match
      print $_;
      print "YOUR DELIMITER HERE\n";
   }
}
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE:  Request for Additional Information
YOUR DELIMITER HERE
RE:  Request for (Additional Information
YOUR DELIMITER HERE
13. Processing TT Instructions -) Audit Note 12
YOUR DELIMITER HERE
Dear test:
YOUR DELIMITER HERE
Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

However, if I combine the examples into a single stream, it stops after the first one. 但是,如果我将示例合并为一个流,它将在第一个流之后停止。 I tried using "g" at the end of the 's' command, but it didn't help: 我尝试在's'命令的末尾使用“ g”,但没有帮助:

$ cat pdf_data_single_stream.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

The replacement string ... 替换字符串...

s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string. ...执行以下操作:查找零个或多个不是'('的字符,然后是单个'(',然后是')'(这是您需要否定前瞻的地方,这消除了'() Tj'的情况下),后跟零个或多个空格,然后记住{如果不是')'的后跟一个字符,如果后面有')Tj',则记住零个或多个后跟字符,然后用记住的字符串。 If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand. 如果有人可以建议(可能非常简单)的方法来使脚本一直通过流,那么那应该可以解决当前的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM