简体   繁体   English

日志格式有多个 ip 时的正则表达式问题

[英]Regex Issue when log format has multiple ip's

I have an issue with fluenTd log parser.我有一个 fluentTd 日志解析器的问题。 The following config works fine when there are 2 ip's.当有 2 个 ip 时,以下配置工作正常。

expression  /^(?<client_ip>[^ ]*)(?:, (?<lb_ip>[^ ]*))? (?<ident>[^ ]*) (?<user>[^ ]*) \[(?<time>[^ ]* [^ ]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) (?<protocol>[A-Z]{1,}[^ ]*)+\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)/

This matches:这匹配:

148.165.41.129, 10.25.1.120 - - [09/Dec/2019:16:22:23 +0000] "GET /comet_request/44109669162/F1551019433002Y5MYEP?F155101943300742PMLG=1551019433877&_=1575904426457 HTTP/1.1" 200 0 0 0

When there are 3 ip's, i get a pattern not match warning.当有 3 个 ip 时,我收到模式不匹配警告。

This doesn't match :这不匹配:

176.30.235.70, 165.225.70.200, 10.25.1.120 - - [09/Dec/2019:13:30:57 +0000] \"GET /comet_request/71142769981/F1551018730440IY5YNF?F1551018721447ZVKYZ4=1551018733078&_=1575898029473 HTTP/1.1\" 200 0 0 0

I tried the following regex, but doesn't work.Can someone please help?我尝试了以下正则表达式,但不起作用。有人可以帮忙吗?

expression /^(?<client_ip>[^ ]*)(?:, (?<proxy_ip>[^ ]*))? (?:, (?<lb_ip>[^ ]*))? (?<ident>[^ ]*) (?<user>[^ ]*) \[(?<time>[^ ]* [^ ]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) (?<protocol>[A-Z]{1,}[^ ]*)+\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)$/

You need to match the IPs with a more specific pattern, like [\\d.]+ or [^, ]+ , and make sure you also match the last two fields (you are not matching them and $ requires the end of line/string).您需要将 IP 与更具体的模式匹配,例如[\\d.]+[^, ]+ ,并确保您还匹配最后两个字段(您不匹配它们并且$需要行尾/细绳)。

Use a pattern like使用像这样的模式

^(?<client_ip>[^ ,]+)(?:, +(?<proxy_ip>[^ ,]+))?(?:, +(?<lb_ip>[^ ,]+))? (?<ident>[^ ]+) (?<user>[^ ]+) \[(?<time>[^\]\[ ]* [^\]\[ ]*)\] "(?<method>\S+)(?: +(?<path>\S+) (?<protocol>[A-Z][^" ]*)[^"]*)?" (?<code>\S+) (?<size>\S+) \S+ \S+$

See the regex demo查看正则表达式演示

The IP matching part is ^(?<client_ip>[^ ,]+)(?:, +(?<proxy_ip>[^ ,]+))?(?:, +(?<lb_ip>[^ ,]+))? IP匹配部分为^(?<client_ip>[^ ,]+)(?:, +(?<proxy_ip>[^ ,]+))?(?:, +(?<lb_ip>[^ ,]+))? , see that [^ ,]+ matches 1+ chars other than a space and , and \\S+ \\S+ are added at the end of the pattern (if these are numbers, you may use \\d+ \\d+ and capture them if needed). ,看到[^ ,]+匹配 1+ 个字符而不是空格和,并且\\S+ \\S+添加在模式的末尾(如果这些是数字,您可以使用\\d+ \\d+并在需要时捕获它们) .

Example strings示例字符串

Let's consider an abbreviated version of your question, focusing on the first four named ranges (as dealing with the remaining named ranges is straightforward).让我们考虑一下您问题的缩写版本,重点关注前四个命名范围(因为处理其余命名范围很简单)。

str1 = "148.165.41.129, 10.25.1.120 - - [09/Dec/2019:16:22:23 +0000]"

str2 = "176.30.235.70, 165.225.70.200, 10.25.1.120 - - [09/Dec/2019:13:30:57 +0000]"

The regular expression written in free-spacing mode以自由间距模式编写的正则表达式

The following regular expression can be used to extract the contents of the named ranges, provided the string has a valid structure.如果字符串具有有效的结构,则可以使用以下正则表达式来提取命名范围的内容。 Notice that it requires IPv4 addresses and the date-time string to have valid patterns (rather than merely [^ ]+ and [^ ]+ [^ ]+ ).请注意,它要求 IPv4 地址和日期时间字符串具有有效模式(而不仅仅是[^ ]+[^ ]+ [^ ]+ )。 I've written the regular expression in free-spacing mode to make it self-documenting.我已经以自由间距模式编写了正则表达式,以使其具有自文档化功能。

r = /
    \A              # match the beginning of the string 
    (?<client_ip>   # begin a capture group named client_ip
      \g<user_ip>   # evaluate the subexpression (capture group) named user_ip
    )               # end capture group client_ip
    (?:             # begin a non-capture group
      ,[ ]          # match the string ', '
      (?<lb_ip>     # begin a capture group named lb_ip
        \g<user_ip> # evaluate the subexpression (capture group) named user_ip
      )             # end capture group lb_ip
    )?              # end non-capture group and optionally execute it
    (?:             # begin a non-capture group
      ,[ ]          # match the string ', '
      (?<user_ip>   # begin a capture group named user_ip
        \d{1,3}     # match 1-3 digits 
        (?:         # begin a non-capture group
          \.\d{1,3} # match a period followed by 1-3 digits
        ){3}        # end the non-capture group and execute 3 times
      )             # end capture group user_id
    )               # end non-capture group
    [ ]-[ ]-[ ]\[   # match the string ' - - ['
    (?<time>        # begin a capture group named time 
      \d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2}[ ]\+\d{4}
                    # match a time string
    )               # end capture group time                    
    \]              # match string ']'
    \z              # match end of string
    /x              # free-spacing regex definition mode

Match the strings against the regular expression将字符串与正则表达式匹配

We now confirm the two strings match this regular expression and extract the contents of the capture groups.我们现在确认两个字符串匹配这个正则表达式并提取捕获组的内容。

    m1 = str1.match(r)
    m1.named_captures
      #=> {"client_ip"=>"148.165.41.129",
      #    "lb_ip"=>nil,
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:16:22:23 +0000"} 

    m2 = str2.match(r)
    m2.named_captures
      #=> {"client_ip"=>"176.30.235.70",
      #    "lb_ip"=>"165.225.70.200",
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:13:30:57 +0000"}

Subexpression Calls子表达式调用

Rather than replicating the content of the capture group user_ip for each of the first two named capture groups I have simply used \\g<user_ip> , which, in effect, tells the regex engine to evaluate the contents of capture group (subexpression) user_ip at the location where \\g<user_ip> is referenced.我没有为前两个命名捕获组中的每一个复制捕获组user_ip的内容,而是简单地使用了\\g<user_ip> ,实际上,它告诉正则表达式引擎评估捕获组(子表达式) user_ip在引用\\g<user_ip>的位置。 Search for "Subexpression Calls" in the docs for Regexp .Regexp的文档中搜索“子表达式调用”。

Notice that the subexpression calls are forward-looking .请注意,子表达式调用是前瞻性的 Suppose we instead wrote:假设我们改为写:

r = /
    \A 
    (?<client_ip>\d{1,3}(?:\.\d{1,3}){3})
    (?:,[ ](?<lb_ip>\g<client_ip>))?
    (?:,[ ](?<user_ip>\g<client_ip>))
    [ ]-[ ]-[ ]\[
    (?<time>\d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2}[ ]\+\d{4}) 
    \]
    \z
    /x

Then然后

    m1 = str1.match(r)
    m1.named_captures
      #=> {"client_ip"=>"10.25.1.120",
      #    "lb_ip"=>nil,
      #    "user_ip"=>"10.25.1.120", 
      #    "time"=>"09/Dec/2019:16:22:23 +0000"}

    m2 = str2.match(r)
    m2.named_captures
      #=> {"client_ip"=>"10.25.1.120",
      #    "lb_ip"=>"165.225.70.200",
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:13:30:57 +0000"} 

As seen, the contents of the capture group client_ip is set equal to the contents of user_ip .正如所见,捕获组client_ip的内容设置为等于user_ip的内容。 The reason for this behaviour is explained here (look for "In PCRE but not Perl, one interesting twist is..." and other referenced sections of that document).此处解释了这种行为的原因(查找“在 PCRE 但不是 Perl,一个有趣的转折是...”以及该文档的其他参考部分)。

The regular expression written conventionally常规编写的正则表达式

The regular expression is conventionally written as follows:正则表达式通常写成如下:

/\A(?<client_ip>\g<user_ip>)(?:, (?<lb_ip>\g<user_ip>))?(?:, (?<user_ip>\d{1,3}(?:\.\d{1,3}){3})) - - \[(?<time>\d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4})\]\z/

Notice that where there are spaces in the above there are character classes containing a single space when the regex is written in free-spacing mode.请注意,当以自由间距模式编写正则表达式时,上面有空格的字符类包含单个空格。 That is necessary because in free-spacing mode unprotected spaces are removed before the expression is parsed.这是必要的,因为在自由间距模式下,在解析表达式之前会删除不受保护的空格。 Another way to protect spaces is to escape them ( \\ ).保护空格的另一种方法是将它们转义 ( \\ )。 If it is desired to use whitespaces rather than spaces, \\s can be used.如果希望使用空格而不是空格,可以使用\\s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM