简体   繁体   中英

Perl regex matching too broadly

I have a strings taken from Linux mail logs that look something like :

May 20 12:19:28 example-03 amavis[1445]: (01445-15) Passed SPAMMY {RelayedTaggedInbound}, [10.4.3.2]:49488 [10.4.3.2] <offers-john=example.com@example.net> -> <john@example.com>, Queue-ID: C00OZs0w9DB, Message-ID: <5ZCfDBMQyiUjOVD78ZFxg5%3D%3D@example.net>, mail_id: aCUpU0wtUaR, Hits: 15.587, size: 21407, queued_as: dgzikuucQ9i, 438 ms

The element I need to extract is :

<offers-john=example.com@example.net> -> <john@example.com>

I want to keep my regex as simple and clear as possible, so I don't want to go into regex for email address formats. Not least because regexing email formats is a bug-prone process !

I have tried :

$row =~ /(<.*> -> <.*>,)/;

But, despite the presence of the comma delimiter, that syntax matches all the way to the end of the end of Message-ID with an output such as :

<offers-john=example.com@example.net> -> <john@example.com>, Queue-ID: C00OZs0w9DB, Message-ID: <5ZCfDBMQyiUjOVD78ZFxg5%3D%3D@example.net>,

You need to make it non-greedy by adding ? to your regex :

(<.*?> -> <.*?>)

Demo

By default the quantifier * is greedy. It matches as much as it can, you need to make it lazy (aka non-greedy) by adding a ? after it. Here is an example .

That is much more robustly written without the non-greedy option, and it is clearer if insignificant whitespace is added with the help of the /x modifier. Like so

$row =~ / ( <[^<>]*> \s* -> \s* <[^<>]*> ) /x;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM