简体   繁体   English

使用java正则表达式解析apache日志文件

[英]Parse apache log file with java regex

I was trying to parse an apache log file and it was going fine with the following pattern:我试图解析一个 apache 日志文件,并且使用以下模式运行正常:

^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"[\\W]+

However, it breaks with the following log:但是,它中断了以下日志:

218.30.103.62 - - [17/May/2015:11:05:11 +0000] "GET /robots.txt HTTP/1.1" 200 - "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"\

I´m not really experienced with regex and I´m trying almost in an error and trial method, any help would be appreciated.我对正则表达式并没有真正的经验,我几乎在尝试错误和试验方法,任何帮助将不胜感激。 (I know that the d+ it's not supposed to be there but that's pretty much what I know...) (我知道 d+ 它不应该在那里,但这几乎是我所知道的......)

Any ideia?任何想法? Thank you谢谢

You format is:你的格式是:

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

(see here ) (见 这里

So your regex will be:所以你的正则表达式将是:

"^(\\S+) (\\S+) (\\S+) \\[(.+?)\\] \\\"(.+?)\\\" (\\d{3}) (\\S+) \\\"(.+?)\\\" \\\"(.+?)\\\"[\\W]+ $"

where matching groups are (I use the references as defined in apache docs):匹配组在哪里(我使用 apache 文档中定义的引用):

  1. %h %H
  2. %l %l
  3. %u %u
  4. %t (without enclosing [] ) %t(不包含[]
  5. %r %r
  6. %>s %>s
  7. %b %b
  8. %{Referer}i %{推荐人}i
  9. %{User-agent}i %{用户代理}i

Note - your regex is a bit overcomplicated, and the reason it fails is because %b is not always a number - when request returns no bytes it will be - instead of 0 .注意 - 您的正则表达式有点过于复杂,它失败的原因是因为 %b 并不总是一个数字 - 当请求不返回字节时,它将是-而不是0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM