简体   繁体   English

用于捕获最小组的正则表达式

[英]Regex for capturing smallest group

I am trying to capture an ID for a PDF Page object that looks like this : 我正在尝试捕获PDF 页面对象的ID,如下所示:

4 0 obj
<<
/Type /Page /
...
>>
endobj

The ID is this ' ID 0 obj'. ID是' ID 0 obj'。 The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object : 问题是我的文件有多个对象,因此以下模式从第一个对象声明捕获到Page对象的第一个实例:

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' : 以下是我的文件示例,如果您想尝试一下,您会看到包含单词“Page”的多个对象:

%PDF-1.3
%¦¦¦¦

1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false  /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj

2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj

3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj

4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj

5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf

What should I change to not make it greedy ? 我应该改变什么才不让它变得贪婪?

EDIT : Clarifications 编辑:澄清

  • I forgot to mention that I need to capture all of the Page object IDs. 我忘了提到我需要捕获所有的Page对象ID。
  • As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. 有些人告诉我使用更具体的正则表达式,我不得不说这不是一个关于如何构建对象的正式例子,这也是可能的。 You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag. 您可以看到空格不是修饰的,并且在页面'/类型/页面'标记之前可以有多个标记。

Example : 示例:

4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj
  • There are tags called Pages , PageLayout , SiglePage and I don't want to capture them. 有一些名为PagesPageLayoutSiglePage的标签,我不想捕捉它们。

You may use 你可以用

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See the regex demo 请参阅正则表达式演示

Details : 细节

  • ^ - start of a line anchor (as m modifier makes ^ match start of a line and not of a whole string) ^ - 行锚的开始(因为m修饰符使得^匹配行的开始而不是整个字符串)
  • (\\d+) 0 obj - 1 or more digits (captured into Group 1), then space, 0 , space and an obj substring (\\d+) 0 obj - 1个或更多个数字(捕获到组1中),然后是空格, 0 ,空格和obj子串
  • (?:(?!^\\d+ 0 obj$).)*? - a tempered greedy token that matches any char ( . ) that does not start a ^\\d+ 0 obj$ pattern, as few times as possible - 一个驯化的贪婪令牌 ,匹配任何不启动^\\d+ 0 obj$模式的char( . ),尽可能少
  • \\/Type\\s*\\/Page\\s - /Type , 0+ whitespaces (replace \\s with \\h to only match horizontal whitespace), /Page and then a whitespace \\/Type\\s*\\/Page\\s - /Type ,0 + whitespaces(将\\s替换为\\h以仅匹配水平空格), /Page然后是空格
  • .*? - any 0+ chars as few as possible up to the first occurrence of - 任何0+字符尽可能少到第一次出现
  • endobj - endobj followed with... endobj - endobj随后......
  • $ - the end of line position. $ - 行结束位置。

You can put in an ungreedy Questionmark to a specific Quantifier: 你可以在特定的量词中加入一个不合格的问号:

Example: 例:

 \(.*\)

Matches: 火柴:

test (test)test(test)test(test) test 测试(测试)测试(测试)测试(测试)测试

Example: 例:

 \(.*?\)

Matches: 火柴:

test (test) test(test)test(test)test 测试(测试)测试(测试)测试(测试)测试

Try more specific regex so it does not match unneeded part of text. 尝试更具体的正则表达式,因此它不匹配不需要的文本部分。

preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);

Proof: https://regex101.com/r/HjyQpS/1 证明: https//regex101.com/r/HjyQpS/1

This should work: 这应该工作:

(\d+) 0 obj[^>]+/Page$

Regex101 demo Regex101演示

I wouldn't work with regular expressions on PDF. 我不会使用PDF上的正则表达式。 There are several conditions, where this approach will fail. 有几个条件,这种方法将失败。

  1. The page object is inside an object stream (and therefor packed, most probably by a Deflate algorithm) (This is allowed with PDF version 1.5 and up) 页面对象位于对象流内(因此打包,很可能是通过Deflate算法)(PDF版本1.5及以上版本允许)
  2. Incremental updates inside the PDF document can lead to double hits on the same page PDF文档中的增量更新可能导致同一页面上的双击
  3. The marker /Page is not inside the dictionary, which you want to match, but inside an indirect object (never seen, but theoretically possible). 标记/页面不在您要匹配的字典中,而是在间接对象内(从未见过,但理论上可能)。 Eg you have: 你有:
 5 0 obj << /Type 6 0 R ....>> endobj 6 0 obj /Page endobj 

Note: You also cannot expect, that each page is written in the order inside the pdf document, as you see it in the viewer. 注意:您也不能指望每个页面都按照pdf文档中的顺序编写,就像您在查看器中看到的那样。

But if you really must do it in that way, i would first match the pdf object with 但是如果你真的必须这样做,我首先要匹配pdf对象

/([0-9]+) 0 obj(.+?)endobj/ /([0-9] +)0 obj(。+?)endobj /

and would search in the second matched string for 并将搜索第二个匹配的字符串

//Type\\s*\\Page[\\s>]/ //类型\\ S * \\页[\\ S>] /

The optional matching for > at the end is important, because you need to be able to match also "/Type/Page>>", where /Type/Page is the last entry in the pdf dictionary. 最后>的可选匹配很重要,因为您还需要能够匹配“/ Type / Page >>”,其中/ Type / Page是pdf字典中的最后一个条目。

Use this regular expression: 使用此正则表达式:

/\d+\s0\sobj.+endobj/smU

Note that the modifier U makes the match non-greedy. 请注意,修饰符U使匹配不贪婪。 See the matching example here: https://www.tinywebhut.com/regex/8 请参阅此处的匹配示例: https//www.tinywebhut.com/regex/8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM