简体   繁体   中英

Javascript Regex multi-line base64

I have the following from a MIME message;

--------------ra650umTsDNeI5lwXmFy5luF
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64

TG9yZW0gSXBzdW0NCg0KSGVyZSBpcyBzb21lIG1vcmUgdGV4dA0KDQpOb3cgb24gYSAzcmQg
bGluZQ0KDQoNClRoYW5rcw0KDQo=

--------------ra650umTsDNeI5lwXmFy5luF--

I want to extract the base64 encoded message, regardless of how many lines it is.

The following will indeed find matches on each individual line, but how can I group them so that if there are multiple lines of base64 that matches, it will group them as "together"

var base64Regex = /^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}={2})$/gm

When the MIME content for example also contains a PGP signature, this would give me 4 or 5 matches, so I can't simply join them, because it will find that base64 as well.

Ideally I'd modify this so it gets everything from/including the first match to ---------- and says that is "match 1" and if it finds another block of base64, that is "match 2", etc.

Here is a link to regex101 showing 2 matches. In short, I would like for this to be one match.

https://regex101.com/r/32WjKa/1

Would this help?

var base64Regex = /Content-Transfer-Encoding: base64([\s\S]*?)\s*?--/g;

Content-Transfer-Encoding: base64 - This is the start of the base64 encoded message.

[\s\S]*? - This is the base64 encoded message. It can be on multiple lines.

\s*? -- \s*? -- - This is the end of the base64 encoded message.

g - This is the global flag, so that it will match all instances of the regex

Instead of looking for base64 characters, I'd look for all characters (including newlines) between the start and end of the HTTP payload.

By default, . in Javascript regexes, even in mulit-line mode, won't match linebreaks. But the /s flag allows for . to match linebreaks.

With this method, you can remove linebreaks after you match with a simple replace()

const str = `--------------ra650umTsDNeI5lwXmFy5luF
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64

TG9yZW0gSXBzdW0NCg0KSGVyZSBpcyBzb21lIG1vcmUgdGV4dA0KDQpOb3cgb24gYSAzcmQg
bGluZQ0KDQoNClRoYW5rcw0KDQo=

--------------ra650umTsDNeI5lwXmFy5luF--`

const payload = str.match(/base64\n\n(.+)\n\n--------------.+/ms)[1].replace(/\n/g, '')

You might also be better off using something like body-parser since HTTP payloads like this are standard.

Here are two solutions, one using a regex .replace() , the other one using a .match() with positive lookbehind and positive lookahead:

 const input = `--------------ra650umTsDNeI5lwXmFy5luF Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64 TG9yZW0gSXBzdW0NCg0KSGVyZSBpcyBzb21lIG1vcmUgdGV4dA0KDQpOb3cgb24gYSAzcmQg bGluZQ0KDQoNClRoYW5rcw0KDQo= --------------ra650umTsDNeI5lwXmFy5luF--`; const regex1 = /^.*?Content-Transfer-Encoding: base64\s+(.*?)\s*---.*$/is; let result1 = input.replace(regex1, '$1'); console.log(result1); const regex2 = /(?<=Content-Transfer-Encoding: base64\s+).*?(?=\s*---)/is; let result2 = input.match(regex2); console.log(result2[0]);

Output:

TG9yZW0gSXBzdW0NCg0KSGVyZSBpcyBzb21lIG1vcmUgdGV4dA0KDQpOb3cgb24gYSAzcmQg
bGluZQ0KDQoNClRoYW5rcw0KDQo=

TG9yZW0gSXBzdW0NCg0KSGVyZSBpcyBzb21lIG1vcmUgdGV4dA0KDQpOb3cgb24gYSAzcmQg
bGluZQ0KDQoNClRoYW5rcw0KDQo=

Explanation of regex 1 for .replace() :

  • ^ -- anchor at start of string
  • .*?Content-Transfer-Encoding: base64\s+ -- literal text up to base 64 , and including whitespace
  • (.*?) -- capture group one: non greedy capture all, until:
  • \s*---.* -- whitespace, --- , and everything after that
  • $ -- anchor at end of string
  • use is flags for case insensitive, and to match newlines with . , respectively

Explanation of regex 2 for .match() :

  • (?<=Content-Transfer-Encoding: base64\s+) -- positive lookbehind for literal text ...base64 , including whitespace
  • .*? -- non-gridy scan, until:
  • (?=\s*---) -- positive lookahead for whitespace and ---
  • use is flags for case insensitive, and to match newlines with . , respectively

Notes:

  • Keep in mind that not all regex flavors and browsers support lookbehind, notably Safari
  • It is safe to scan for --- to find the end because dashes are not part of base64 characters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM