![](/img/trans.png)
[英]Extract URL in JSON String with Python using re.match() or split()
[英]parsing url using re.match().groups() in python
如果這個問題看起來很基本,請提前道歉。
鑒於:
Apache HTTP 訪問日志文件如下:
sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00 +0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'
目標:
我使用以下模式提取信息:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK
然后我將所有信息轉儲到列表中以供進一步處理:
cleaned_lines = []
cleaned_lines.append({
"timestamp": l[0],
"client_request_line": l[1],
"status": l[2],
"bytes_sent": l[3],
"referer": l[4],
"user_agent": l[5],
"session_id": l[6],
})
問題:
有時存在一些帶有損壞的 url (referer) 的行(以http://192.168.8.1/
開頭)類似於:
sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'
我想使用正則表達式來操縱它們,說總是以http://LETTERS
開頭,這就是為什么我將代碼更改為:
CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'
<<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)
但隨后出現了錯誤:
AttributeError Traceback (most recent call last)
<ipython-input-88-c7a93cfbce61> in <module>
4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
5 print (matched_line)
----> 6 l = matched_line.groups()
7 print(l)
AttributeError: 'NoneType' object has no attribute 'groups'
我在re.match().groups()
之間做錯了什么嗎?
使用re.findall()
然后re.split()
。
pattern = '(http://\D.*)' #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start) #to get the url alone by
#splitting on whitespace
url = url[0]
您可能需要使用str.strip()
刪除包含 url 的所有剩余特殊字符。
如果您必須使用re.match()
嘗試簡化模式。
pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')
使用Match.groups()
返回一個元組。 在上面使用Match.group()
。 嘗試:
pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]
如果 url 是你所需要的,你可以使用split()
sample_apache_access_log_line = [
'- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995',
'- - [01/Feb/2017:12:34:53 +0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]
for i in sample_apache_access_log_line:
if 'address=' in i:
print(i.split('"')[3].split('address=')[1])
else:
print(i.split('"')[3])
# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.