[英]parsing url using re.match().groups() in python
Apologies in advance if this question seems quite basic.如果这个问题看起来很基本,请提前道歉。
Given :鉴于:
Apache HTTP Access Log file as follows: Apache HTTP 访问日志文件如下:
sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00 +0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'
Goal :目标:
I extract information with the following pattern:我使用以下模式提取信息:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK
which I then dump all info into a list for further processing:然后我将所有信息转储到列表中以供进一步处理:
cleaned_lines = []
cleaned_lines.append({
"timestamp": l[0],
"client_request_line": l[1],
"status": l[2],
"bytes_sent": l[3],
"referer": l[4],
"user_agent": l[5],
"session_id": l[6],
})
Problem :问题:
There exists sometime some lines with broken url (referer) (starting with http://192.168.8.1/
) similar to:有时存在一些带有损坏的 url (referer) 的行(以
http://192.168.8.1/
开头)类似于:
sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'
which I would like to manipulate them using regex to say always start with http://LETTERS
, that is why I changed my code to:我想使用正则表达式来操纵它们,说总是以
http://LETTERS
开头,这就是为什么我将代码更改为:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
<<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)
But then here it comes the error:但随后出现了错误:
AttributeError Traceback (most recent call last)
<ipython-input-88-c7a93cfbce61> in <module>
4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
5 print (matched_line)
----> 6 l = matched_line.groups()
7 print(l)
AttributeError: 'NoneType' object has no attribute 'groups'
Is there anything I'm doing wrong between for re.match().groups()
?我在
re.match().groups()
之间做错了什么吗?
Use re.findall()
then re.split()
.使用
re.findall()
然后re.split()
。
pattern = '(http://\D.*)' #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start) #to get the url alone by
#splitting on whitespace
url = url[0]
You may need to use str.strip()
to remove any remaining special characters enclosing the url.您可能需要使用
str.strip()
删除包含 url 的所有剩余特殊字符。
If you must use re.match()
try simplifying the pattern.如果您必须使用
re.match()
尝试简化模式。
pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')
Using Match.groups()
returns a tuple.使用
Match.groups()
返回一个元组。 Used Match.group()
above.在上面使用
Match.group()
。 Try:尝试:
pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]
if url is all you need you can just use split()
如果 url 是你所需要的,你可以使用
split()
sample_apache_access_log_line = [
'- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995',
'- - [01/Feb/2017:12:34:53 +0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]
for i in sample_apache_access_log_line:
if 'address=' in i:
print(i.split('"')[3].split('address=')[1])
else:
print(i.split('"')[3])
# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.