简体   繁体   English

php正则表达式得到base64字符串

[英]php regex to get base64 string

I have a file smime.p7m with many content.我有一个包含很多内容的文件 smime.p7m。 One or more of this Content is like this一个或多个内容是这样的

--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"

JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--

Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below?如果它是 pDF 和下面的 BASE64 代码,有没有办法用正则表达式获取文件名? It can happen that there is more than one PDF file in the file.有可能文件中有多个PDF文件。

The Filename is not the problem.文件名不是问题。 I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename我用“filename=”(.*).pdf 得到这个。但我不知道如何在文件名后得到 base64code

base64 consists of characters A...Z a...z digits 0..9 symbols + and / . base64由字符 A...Z a...z 数字 0..9 符号+/组成。 It also can have one or two = in the end and can be split to several lines.它也可以在最后有一个或两个=并且可以分成几行。

if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
   print("FileName: {$regres['filename']}\n");
   print("Base64: {$regres['base64']}\n");
}

Use采用

(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)

See proof证明

PHP : PHP :

preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);

Explanation解释

--------------------------------------------------------------------------------
  (?im)                    set flags for this block (case-
                           insensitive) (with ^ and $ matching start
                           and end of line) (with . not matching \n)
                           (matching whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  filename="               'filename="'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    pdf                      'pdf'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  "                        '"'
--------------------------------------------------------------------------------
  \R+                      any line break sequence (1 or more times (matching 
                           the most  amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    .+                       any character except \n (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (1 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \R                       any line break sequence
--------------------------------------------------------------------------------
      .+                       any character except \n (1 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )+                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \2

I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.我收集到这个任务根本不是关于验证,而是只关注数据提取——这使得锐化正则表达式逻辑变得不必要了。

You only need a pattern that will match filename=" at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf ), then after any number of whitespace characters, capture all characters until one or two = are encountered,您只需要一个匹配行开头的filename="的模式,然后捕获引号包裹的 substring (只要它以.pdf ),然后在任意数量的空白字符之后,捕获所有字符,直到一个或两个=遇到,

Using greedy negative character classes allows the regex engine to move quickly.使用贪婪的否定字符类允许正则表达式引擎快速移动。 The m pattern modifier tells the regex engine that the ^ meta character (not the ^ used inside of square braces) may match the start of a line in addition to the start of the string. m模式修饰符告诉正则表达式引擎^元字符(不是方括号内使用的^ )除了匹配字符串的开头外,还可以匹配行的开头。

Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column() does a snappy job of setting that up when there are qualifying matches.也许您想生成一个关联数组,其中键是文件名字符串,编码字符串是值, array_column()会在存在符合条件的匹配项时快速设置它。

Code: ( Demo )代码:(演示

var_export(
    preg_match_all(
        '~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
        $fileContents,
        $out,
        PREG_SET_ORDER
    )
    ? array_column($out, 2, 1)
    : "no pdf's found"
);

Output: Output:

array (
  '001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM