简体   繁体   English

带有可变部分的正则表达式

[英]regex with variable part

How can I merge these 2 regex's to a single regex which captures all available parts depending on the string structure ( the last 3 fields in $s are optional and should be captured if they exists)? 如何将这两个正则表达式合并到单个正则表达式中,该正则表达式将根据字符串结构捕获所有可用部分($ s中的最后3个字段是可选的,如果存在则应该捕获)? Using (?= ... ) I could not get a working solution. 使用(?= ...)我无法获得有效的解决方案。

$s='1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    \Z/x;
print "[".join('],[',$s =~ $re)."]\n\n";   

$s='1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+) [ ] "(.*?)" [ ] "(.*?)" [ ] "(.*?)"
        \Z
        /x;
print "[".join('],[',$s =~ $re)."]\n\n";   

When your regexes start looking like that, I think its a good idea to start thinking about alternatives. 当您的正则表达式看起来像这样时,我认为开始考虑替代方案是一个好主意。 In this case, you might try Text::ParseWords , since your strings are sort of delimited and contain quoted strings. 在这种情况下,您可以尝试使用Text::ParseWords ,因为您的字符串是带分隔符的并且包含带引号的字符串。 It is a core module in perl 5. 它是perl 5中的核心模块。

Basically what we're doing is supplying a regex for the delimiters that we expect, a 0 or 1 for keeping the quotes, and the input lines themselves. 基本上,我们正在为期望的分隔符提供一个正则表达式,一个用于保留引号的0或1以及输入行本身。

use strict;
use warnings;
use Text::ParseWords;

my $s = '1.2.3.4 - egon  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488';
my @s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

$s = '1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"';
@s = quotewords('[\s/:\[\].]+', 0, $s);
print "[".join('],[',@s)."]\n\n";   

Output: 输出:

[1],[2],[3],[4],[-],[egon],[10],[Dec],[2007],[21],[07],[20],[+0100],[GET /x.htm
HTTP/1.1],[401],[488]

[1],[2],[3],[4],[-],[-],[13],[Jun],[2007],[01],[37],[44],[+0200],[GET /x.htm HTT
P/1.0],[404],[283],[-],[Mozilla/5.0...],[-]

Instead of using a lookahead (?=) , you can use a non-capturing group (?:) and match zero or one occurrence: 可以使用非捕获组(?:)匹配先行(?=)并匹配零个或一个匹配项:

$re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    [ ] (\S+)
    [ ]+ \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(\S+) [ ] (.*?) [ ] (\S+)"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

This will yield fixed-length array of captures, but the last 3 will be undef if the optional capture group does not match. 这将生成固定长度的捕获数组,但是如果可选的捕获组不匹配,则后3个将为undef。 If you have to match between 1 and 3 optional fields, wrap each in its own non-capturing group with zero or more ( ? ) occurrences. 如果必须在1-3个可选字段之间进行匹配,请将每个字段包装在其自己的不包含零个或多个( ? )的组中。 I also tried this, but it doesn't work: 我也试过了,但是没有用:

(?: [ ] "(.*?)" ){0,3} \Z

It matches, and captures each of the last three fields, but each capture overwrites the final position in the capture array, so after the capture is done, it contains just the final field. 它匹配并捕获最后三个字段中的每个字段,但是每个捕获都覆盖捕获数组中的最终位置,因此在完成捕获之后,它仅包含最终字段。

I would caution you that you are using a very strict expression that may not be suited to all web logs: specifically, the match for IP address will not handle IPv6 addresses, and the match for User-agent may not handle user agents with " characters, depending on how they are escaped (lighttpd 1.4.28 does not escape them, for instance). 我要提醒您,您使用的是非常严格的表达式,该表达式可能不适合所有Web日志:特别是IP地址的匹配项不会处理IPv6地址,而User-agent的匹配项可能无法处理带有"字符"用户代理,具体取决于如何对它们进行转义(例如,lighttpd 1.4.28不会对它们进行转义)。

I did not want to talk any solution hints down. 我不想谈论任何解决方案的建议。

How I said before: Nice idea. 我以前怎么说:好主意。 But it only does what the package name predicates: ParseWords. 但是它只执行包名称所声明的内容:ParseWords。

"Find me a test case where your regex works and my solution fails if you want to continue this discussion ...". “为我找到一个测试用例,您的正则表达式可以在其中运行,如果您想继续此讨论,我的解决方案将失败...”。

Of course I have testet your solution for my purposes. 当然,出于我的目的,我会为您提供解决方案。

In your solution the fields are shifted around, depending on the input. 在您的解决方案中,取决于输入,字段会四处移动。

With the regex I'll find the fields always at defined positions. 使用正则表达式,我将发现字段始终位于定义的位置。

(for example: Authuser at $token[5] and Year at $token[9] ) (例如:$ token [5]的Authuser和$ token [9]的Year)

Here is the test: 这是测试:

#!/usr/bin/perl -w
use strict;
use warnings;
use FileHandle;
use Text::ParseWords;

my $re = qr/\A
        (\d+)\.(\d+)\.(\d+)\.(\d+)
    [ ] (\S+)
    (?: [ ] (\S*))? (?: [ ] (\S*))?
    [ ] \[(\d+)\/(\S+)\/(\d+):(\d+):(\d+):(\d+) [ ] (\S+)\]
    [ ] "(?:(\S+) [ ])? (.*?) (?:[ ] (\S+))?"
    [ ] (\S+)
    [ ] (\S+)
    (?:
        [ ] "(.*?)"
        [ ] "(.*?)"
        [ ] "(.*?)"
    )?
    \Z/x;

my (@s,@token);
#---- most entries ------------------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283');
#---- referer, user agent, ... ------------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /x.htm HTTP/1.0" 404 283 "-" "Mozilla/5.0..." "-"');
#---- auth without password ---------------------------------------------------
push(@s,'1.2.3.4 - ausr  [10/Dec/2007:21:07:20 +0100] "GET /x.htm HTTP/1.1" 401 488');
#---- no http request --------------------------------------------------------- 
push(@s,'1.2.3.4 - - [13/Jun/2007:19:16:18 +0200] "-" 408 -');
#---- auth with password ------------------------------------------------------
push(@s,'1.2.3.4 - ausr pwd [12/Jul/2006:16:55:04 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- auth without user -------------------------------------------------------
push(@s,'1.2.3.4 -  pwd [16/Aug/2007:08:43:50 +0200] "GET /x.htm HTTP/1.1" 401 489');
#---- multiple words in request -----------------------------------------------
push(@s,'1.2.3.4 - - [13/Jun/2007:01:37:44 +0200] "GET /this is test HTTP/1.0" 404 283'); 

no warnings 'uninitialized';
foreach(@s)
{ @token=$_ =~ $re;
  print "regex:      AUTHUSER=".$token[5].", YEAR=".$token[9]."\n";
  @token=quotewords('[\s/:\[\].]+', 0, $_);
  print "quotewords: AUTHUSER=".$token[5].", YEAR=".$token[9]."\n\n";
}

and here the results: 结果如下:

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

regex:      AUTHUSER=ausr, YEAR=2007
quotewords: AUTHUSER=ausr, YEAR=21

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=19

regex:      AUTHUSER=ausr, YEAR=2006
quotewords: AUTHUSER=ausr, YEAR=2006

regex:      AUTHUSER=, YEAR=2007
quotewords: AUTHUSER=pwd, YEAR=08

regex:      AUTHUSER=-, YEAR=2007
quotewords: AUTHUSER=-, YEAR=01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM