简体   繁体   English

Perl中的正则表达式任务

[英]Regular expression task in Perl

I have a simple question around a regex problem. 我有一个关于正则表达式问题的简单问题。

Given the following example string: 给定以下示例字符串:

Apr 2 13:42:32 sandbox izxp[12000]: Received disconnect from 10.11.106.14: 10: disconnected by user

I need to separate this string into 4 different strings. 我需要将此字符串分成4个不同的字符串。 As you can see: date ( Apr 2 ), time ( 13:42:32 ), server name ( sandbox ) and other data ( izxp[12000]: Received disconnect from 10.11.106.14: 10: disconnected by user ). 如您所见:日期( Apr 2 ),时间( 13:42:32 ),服务器名称( sandbox )和其他数据( izxp[12000]: Received disconnect from 10.11.106.14: 10: disconnected by user )。

These will be variable values after. 这些将是可变值。

I would be very happy someone can help me out! 我会很高兴有人可以帮助我!

Thx! 谢谢!

It's a little easier to use split for this task. 使用split可以轻松完成此任务。

my ($date1, $date2, $time, $host, $data) = split(' ', $str, 5);
my $date = "$date1 $date2";

I always use what I call "scan patterns" for this type of thing. 对于这类事情,我总是使用所谓的“扫描模式”。 The format for the date is pretty easy: 日期的格式非常简单:

/((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+)

The expression for the time isn't much harder 当时的表达并不难

/(\d\d:\d\d:\d\d)/

Once you've got that out of the way, I think it's easy enough to specify the server like so: 一旦解决了这个问题,我认为这样指定服务器就很容易了:

/(\w+)/

The next part is just everything else, so the pattern can be concatenated together as: 下一部分就是其他所有内容,因此该模式可以串联在一起:

/((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+)\s+(\d\d:\d\d:\d\d)\s+(\w+)\s+(.*)/

And you can store that data in Perl by this expression: 您可以通过以下表达式将数据存储在Perl中:

my ( $date, $time, $host, $desc ) 
    = $str =~ m/((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+)
                \s+(\d\d:\d\d:\d\d)\s+(\w+)\s+(.*)
               /x
    ;

Note that performance comparison between split or prepared regexps (in output below: re1 is from Axeman and re2 is simplified as /(\\S+ \\S+) (\\S+) (\\S+) (.*)/ ) confirms that split is winning but the difference is tiny, you won't even notice it on less than a million lines parsed. 请注意,拆分或准备好的正则表达式之间的性能比较(在下面的输出中:re1来自Axeman,而re2简化为/(\\S+ \\S+) (\\S+) (\\S+) (.*)/ ),确认拆分是成功的,但差别很小,在解析的行数少于一百万的情况下,您甚至不会注意到它。 Axeman regexp could be improved further to help you prove validity of your input which is very important thing. Axeman regexp可以进一步改进,以帮助您证明输入的有效性,这是非常重要的。

10mln iterations comparison: 10百万次迭代比较:

        Rate  re1  re2  spl
re1 250000/s   -- -28% -57%
re2 344828/s  38%   -- -41%
spl 584795/s 134%  70%   --

Here is rundown on 100 mln calls on ancient Core Duo: 以下是古代Core Duo的1亿个通话的摘要:

re1: 40 wallclock secs (39.84usr+0.00sys=39.84CPU) @ 251004.02/s (n=10000000)
re2: 29 wallclock secs (29.04usr+0.01sys=29.05CPU) @ 344234.08/s (n=10000000)
spl: 18 wallclock secs (16.77usr+0.00sys=16.77CPU) @ 596302.92/s (n=10000000)

It looks significant at this amount of records. 在这样的记录量下,它看起来很重要。 But if you will check validity of say rest of data string somewhere, you are better to check it once in parsing stage. 但是,如果您要在某处检查其余数据字符串的有效性,最好在解析阶段对其进行一次检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM