![](/img/trans.png)
[英]Extract URL in JSON String with Python using re.match() or split()
[英]parsing url using re.match().groups() in python
如果这个问题看起来很基本,请提前道歉。
鉴于:
Apache HTTP 访问日志文件如下:
sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00 +0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'
目标:
我使用以下模式提取信息:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK
然后我将所有信息转储到列表中以供进一步处理:
cleaned_lines = []
cleaned_lines.append({
"timestamp": l[0],
"client_request_line": l[1],
"status": l[2],
"bytes_sent": l[3],
"referer": l[4],
"user_agent": l[5],
"session_id": l[6],
})
问题:
有时存在一些带有损坏的 url (referer) 的行(以http://192.168.8.1/
开头)类似于:
sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'
我想使用正则表达式来操纵它们,说总是以http://LETTERS
开头,这就是为什么我将代码更改为:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
<<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)
但随后出现了错误:
AttributeError Traceback (most recent call last)
<ipython-input-88-c7a93cfbce61> in <module>
4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
5 print (matched_line)
----> 6 l = matched_line.groups()
7 print(l)
AttributeError: 'NoneType' object has no attribute 'groups'
我在re.match().groups()
之间做错了什么吗?
使用re.findall()
然后re.split()
。
pattern = '(http://\D.*)' #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start) #to get the url alone by
#splitting on whitespace
url = url[0]
您可能需要使用str.strip()
删除包含 url 的所有剩余特殊字符。
如果您必须使用re.match()
尝试简化模式。
pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')
使用Match.groups()
返回一个元组。 在上面使用Match.group()
。 尝试:
pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]
如果 url 是你所需要的,你可以使用split()
sample_apache_access_log_line = [
'- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995',
'- - [01/Feb/2017:12:34:53 +0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]
for i in sample_apache_access_log_line:
if 'address=' in i:
print(i.split('"')[3].split('address=')[1])
else:
print(i.split('"')[3])
# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.