[英]php regex to get base64 string
I have a file smime.p7m with many content.我有一个包含很多内容的文件 smime.p7m。 One or more of this Content is like this
一个或多个内容是这样的
--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--
Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below?如果它是 pDF 和下面的 BASE64 代码,有没有办法用正则表达式获取文件名? It can happen that there is more than one PDF file in the file.
有可能文件中有多个PDF文件。
The Filename is not the problem.文件名不是问题。 I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename
我用“filename=”(.*).pdf 得到这个。但我不知道如何在文件名后得到 base64code
base64 consists of characters A...Z a...z digits 0..9 symbols +
and /
. base64由字符 A...Z a...z 数字 0..9 符号
+
和/
组成。 It also can have one or two =
in the end and can be split to several lines.它也可以在最后有一个或两个
=
并且可以分成几行。
if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}\n");
print("Base64: {$regres['base64']}\n");
}
Use采用
(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)
PHP : PHP :
preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);
Explanation解释
--------------------------------------------------------------------------------
(?im) set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
filename=" 'filename="'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
pdf 'pdf'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
\R+ any line break sequence (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\R any line break sequence
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \2
I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.我收集到这个任务根本不是关于验证,而是只关注数据提取——这使得锐化正则表达式逻辑变得不必要了。
You only need a pattern that will match filename="
at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf
), then after any number of whitespace characters, capture all characters until one or two =
are encountered,您只需要一个匹配行开头的
filename="
的模式,然后捕获引号包裹的 substring (只要它以.pdf
),然后在任意数量的空白字符之后,捕获所有字符,直到一个或两个=
遇到,
Using greedy negative character classes allows the regex engine to move quickly.使用贪婪的否定字符类允许正则表达式引擎快速移动。 The
m
pattern modifier tells the regex engine that the ^
meta character (not the ^
used inside of square braces) may match the start of a line in addition to the start of the string. m
模式修饰符告诉正则表达式引擎^
元字符(不是方括号内使用的^
)除了匹配字符串的开头外,还可以匹配行的开头。
Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column()
does a snappy job of setting that up when there are qualifying matches.也许您想生成一个关联数组,其中键是文件名字符串,编码字符串是值,
array_column()
会在存在符合条件的匹配项时快速设置它。
var_export(
preg_match_all(
'~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);
Output: Output:
array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.