在python中解析wget日志文件

Question

我有一個wget日志文件，想解析該文件，以便為每個日志條目提取相關信息。 例如IP地址，時間戳，URL等。

下面是示例日志文件。 每個條目的行數和詳細信息都不相同。 一致的是每一行的符號。

我能夠提取單個行，但是我想要一個多維數組（或類似數組）：

import re

f = open('c:/r1/log.txt', 'r').read()


split_log =  re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f)

print split_log

print len(split_log)

for element in split_log:
    print(element)


####### Start log file example

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31--  http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

     0K .......... .......                                      109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32--  h ttp://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

     0K .......... .......... ..                                118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32--  h ttp://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

     0K .......... .......                                      111K=0.2s

Answer 1

這是您可以提取所需數據並將其存儲在元組列表中的方法。

我在這里使用的正則表達式並不完美，但是可以與您的示例數據一起正常工作。 我修改了您的原始正則表達式，以使用更具可讀性的\\d而不是等效的[0-9] 。 我還使用了原始字符串，這通常使使用正則表達式更加容易。

我已經將您的日志數據作為三引號字符串嵌入到了我的代碼中，因此我不必擔心文件處理。 我注意到您的日志文件中某些URL中有空格，例如

h ttp://www.itb.ie/Vacancies/index.html

但是我認為這些空間是復制和粘貼的產物，它們實際上並不存在於實際的日志數據中。 如果不是這種情況，那么您的程序將需要做更多的工作來應對這些多余的空間。

我還修改了日志數據中的IP地址，因此它們並不完全相同，只是為了確保findall找到的每個IP都與正確的時間戳和URL正確關聯。

#! /usr/bin/env python

import re

log_lines = '''

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31--  http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

     0K .......... .......                                      109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32--  http://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

     0K .......... .......... ..                                118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32--  http://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

     0K .......... .......                                      111K=0.2s
'''

time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)')
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|')

time_and_url_list = time_and_url_pat.findall(log_lines)
print '\ntime and url\n', time_and_url_list

ip_list = ip_pat.findall(log_lines)
print '\nip\n', ip_list

all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)]
print '\nall\n', all_data, '\n'

for t in all_data:
    print t

產量

time and url
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')]

ip
['193.1.36.24', '193.1.36.25', '193.1.36.26']

all
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')] 

('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24')
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25')
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')

該代碼的最后一部分使用列表推導將time_and_url_list和ip_list中的數據重組為一個元組列表，並使用zip內置函數並行處理兩個列表。 如果很難理解該部分，請告訴我，我將嘗試進一步解釋。

在python中解析wget日志文件

問題描述

1 個解決方案

解決方案1
1 已采納 2014-11-22 13:02:59

在python中解析wget日志文件

問題描述

1 個解決方案

解決方案1 1 已采納 2014-11-22 13:02:59

解決方案1
1 已采納 2014-11-22 13:02:59