通过Python解析Apache日志文件

Question

I am making a python log parser script where I need to count the number of logs whose status code is 200 from a log file. 我正在制作一个python日志解析器脚本，在该脚本中，我需要从日志文件计算状态代码为200的日志的数量。

Here are some of the logs from the file: 以下是文件中的一些日志：

120.115.144.240 - - [29/Aug/2017:04:40:03 -0400] "GET /apng/assembler-2.0/assembler2.php HTTP/1.1" 404 231 "http://littlesvr.ca/apng/history.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36"

202.167.250.99 - - [29/Aug/2017:04:41:10 -0400] "GET /apng/images/o_sample.png?1424751982?1424776117 HTTP/1.1" 200 115656 "http://bbs.mydigit.cn/read.php?tid=2186780&fpage=3" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"

14.152.69.236 - - [29/Aug/2017:04:41:41 -0400] "GET /apng/images/o_sample.png?1424751982?1424776117 HTTP/1.1" 304 - "http://bbs.mydigit.cn/read.php?tid=2205351" "Mozilla/5.0 (Linux; U; Android 7.1.2; zh-CN; NX510J Build/NJH47D) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.6.6.951 Mobile Safari/537.36"

60.4.236.27 - - [29/Aug/2017:04:42:46 -0400] "GET /apng/images/o_sample.png?1424751982?1424776117 HTTP/1.1" 200 115656 "http://bbs.mydigit.cn/read.php?tid=1952896" "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"

58.62.17.190 - - [29/Aug/2017:04:50:01 -0400] "GET /apng/gif_apng_webp1.html HTTP/1.1" 200 935 "http://dev.qq.com/topic/582939577ef9c5b708556b0d" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

I have tried this code but the only output I'm getting is a long list of closed brackets [] : 我已经试过了这段代码，但是我得到的唯一输出是一长串的括号[] ：

#!/usr/bin/env python3

import sys
import re

f = open('accesslogfile', 'r')
print('Reading log files... done.')
nooflines = f.readlines()

for line in nooflines:    
    regex = re.match(r'\d{200}\s', line)
    print(regex)
f.close()

In this case, I know the output should be 3 (As there are only three logs that have the status code 200) but I can't seem to get it. 在这种情况下，我知道输出应该为3 （因为只有三个日志的状态码为200），但是我似乎无法得到它。 Any help would be appreciated. 任何帮助，将不胜感激。

Thanks :) 谢谢：）

Answer 1

Just change your regex to (200)\\s . 只需将您的正则表达式更改为(200)\\s 。 What you are doing is matching 200 of any digit and then one character of white space (like a line break of a space or a tab). 您正在做的是匹配任意数字200，然后匹配一个空格字符（例如空格或制表符的换行符）。 What you want is to match the token "200 ". 您要匹配令牌“ 200”。 So just put (200)\\s as your regex. 因此，只需将(200)\\s作为您的正则表达式。

Answer 2

You are doing following things wrong here. 您在这里执行错误操作。

Using match instead of search. 使用匹配而不是搜索。 See difference here 在这里看到差异

Using {200} instead of {3} 使用{200}代替{3}

And not adding \\s in the regex 而不是在正则表达式中添加\\ s

So your regex should be 所以你的正则表达式应该是

re.search(r'\s\d{3}\s', line)

So you have the following code: 因此，您具有以下代码：

import re
counter = 0
for line in log.split('\n'):
    if line:
        regex = re.search(r'\s\d{3}\s', line)
        if regex.group().strip() == '200':
            counter += 1
print('Found ', counter)

Output: 输出：

('Found ', 3) （'找到'，3）

Answer 3

import pandas


df = pandas.read_csv("log_path", sep='\s+', names=[i for i in range(10)])

print(df.loc[df[6] == 200])
print(len(df.loc[df[6] == 200]))

Answer 4

很简单：

re.findall('(HTTP/1.1\" 200)',line)

通过Python解析Apache日志文件

问题描述

4 个解决方案

解决方案1
0 2017-11-27 01:48:18

解决方案2
0 2017-11-27 01:48:29

解决方案3
0 2017-11-27 01:59:11

解决方案4
0 2017-11-27 02:10:33

通过Python解析Apache日志文件

问题描述

4 个解决方案

解决方案1 0 2017-11-27 01:48:18

解决方案2 0 2017-11-27 01:48:29

解决方案3 0 2017-11-27 01:59:11

解决方案4 0 2017-11-27 02:10:33

解决方案1
0 2017-11-27 01:48:18

解决方案2
0 2017-11-27 01:48:29

解决方案3
0 2017-11-27 01:59:11

解决方案4
0 2017-11-27 02:10:33