[英]Regex for capturing smallest group
I am trying to capture an ID for a PDF Page object that looks like this : 我正在尝试捕获PDF 页面对象的ID,如下所示:
4 0 obj
<<
/Type /Page /
...
>>
endobj
The ID is this ' ID 0 obj'. ID是' ID 0 obj'。 The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object : 问题是我的文件有多个对象,因此以下模式从第一个对象声明捕获到Page对象的第一个实例:
preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);
Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' : 以下是我的文件示例,如果您想尝试一下,您会看到包含单词“Page”的多个对象:
%PDF-1.3
%¦¦¦¦
1 0 obj
<<
/Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >>
/Outlines 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<<
/Type /Outlines
/Count 0
>>
endobj
3 0 obj
<<
/Type /Pages
/Count 2
/Kids [ 4 0 R 6 0 R ]
>>
endobj
4 0 obj
<<
/Type /Page
/Parent 3 0 R
/Resources <<
/Font <<
/F1 9 0 R
>>
/ProcSet 8 0 R
>>
/MediaBox [0 0 612.0000 792.0000]
/Contents 5 0 R
>>
endobj
5 0 obj
<< /Length 1074 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( A Simple PDF File ) Tj
ET
BT
/F1 0010 Tf
What should I change to not make it greedy ? 我应该改变什么才不让它变得贪婪?
EDIT : Clarifications 编辑:澄清
Example : 示例:
4 0 obj
<< /UselessTag/Type/Page/
...
>>
endobj
You may use 你可以用
'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'
See the regex demo 请参阅正则表达式演示
Details : 细节 :
^
- start of a line anchor (as m
modifier makes ^
match start of a line and not of a whole string) ^
- 行锚的开始(因为m
修饰符使得^
匹配行的开始而不是整个字符串) (\\d+) 0 obj
- 1 or more digits (captured into Group 1), then space, 0
, space and an obj
substring (\\d+) 0 obj
- 1个或更多个数字(捕获到组1中),然后是空格, 0
,空格和obj
子串 (?:(?!^\\d+ 0 obj$).)*?
- a tempered greedy token that matches any char ( .
) that does not start a ^\\d+ 0 obj$
pattern, as few times as possible - 一个驯化的贪婪令牌 ,匹配任何不启动^\\d+ 0 obj$
模式的char( .
),尽可能少 \\/Type\\s*\\/Page\\s
- /Type
, 0+ whitespaces (replace \\s
with \\h
to only match horizontal whitespace), /Page
and then a whitespace \\/Type\\s*\\/Page\\s
- /Type
,0 + whitespaces(将\\s
替换为\\h
以仅匹配水平空格), /Page
然后是空格 .*?
- any 0+ chars as few as possible up to the first occurrence of - 任何0+字符尽可能少到第一次出现 endobj
- endobj
followed with... endobj
- endobj
随后...... $
- the end of line position. $
- 行结束位置。 You can put in an ungreedy Questionmark to a specific Quantifier: 你可以在特定的量词中加入一个不合格的问号:
Example: 例:
\(.*\)
Matches: 火柴:
test (test)test(test)test(test) test 测试(测试)测试(测试)测试(测试)测试
Example: 例:
\(.*?\)
Matches: 火柴:
test (test) test(test)test(test)test 测试(测试)测试(测试)测试(测试)测试
Try more specific regex so it does not match unneeded part of text. 尝试更具体的正则表达式,因此它不匹配不需要的文本部分。
preg_match_all("/([0-9]+?) 0 obj\n\<\<\n\/Type\s\/Page[ \n]*?\//s", $input_lines, output_array);
Proof: https://regex101.com/r/HjyQpS/1 证明: https : //regex101.com/r/HjyQpS/1
I wouldn't work with regular expressions on PDF. 我不会使用PDF上的正则表达式。 There are several conditions, where this approach will fail. 有几个条件,这种方法将失败。
5 0 obj << /Type 6 0 R ....>> endobj 6 0 obj /Page endobj
Note: You also cannot expect, that each page is written in the order inside the pdf document, as you see it in the viewer. 注意:您也不能指望每个页面都按照pdf文档中的顺序编写,就像您在查看器中看到的那样。
But if you really must do it in that way, i would first match the pdf object with 但是如果你真的必须这样做,我首先要匹配pdf对象
/([0-9]+) 0 obj(.+?)endobj/ /([0-9] +)0 obj(。+?)endobj /
and would search in the second matched string for 并将搜索第二个匹配的字符串
//Type\\s*\\Page[\\s>]/ //类型\\ S * \\页[\\ S>] /
The optional matching for > at the end is important, because you need to be able to match also "/Type/Page>>", where /Type/Page is the last entry in the pdf dictionary. 最后>的可选匹配很重要,因为您还需要能够匹配“/ Type / Page >>”,其中/ Type / Page是pdf字典中的最后一个条目。
Use this regular expression: 使用此正则表达式:
/\d+\s0\sobj.+endobj/smU
Note that the modifier U
makes the match non-greedy. 请注意,修饰符U
使匹配不贪婪。 See the matching example here: https://www.tinywebhut.com/regex/8 请参阅此处的匹配示例: https : //www.tinywebhut.com/regex/8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.