在 python 中使用 re.match().groups() 解析 url

Question

Apologies in advance if this question seems quite basic.如果这个问题看起来很基本，请提前道歉。

Given :鉴于：

Apache HTTP Access Log file as follows: Apache HTTP 访问日志文件如下：

sample_apache_access_log_line = '- - [01/Feb/2017:00:00:00 +0200] "GET /aikakausi/binding/1145113/image/14 HTTP/1.1" 200 658925 "http://digi.kansalliskirjasto.fi/aikakausi/binding/1145113?page=14&term=HOIKKA" "Mozilla/5.0 (Linux; Android 5.1.1; SM-J320FN Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/49.0.2623.105 Mobile Safari/537.36 [FB_IAB/MESSENGER;FBAV/100.0.0.29.61;]" 569'

Goal :目标：

I extract information with the following pattern:我使用以下模式提取信息：

CUSTOM_LOG_PATTERN = '- - \[.*?\] "(.*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)'    
matched_line = re.match(ACCESS_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # WORKS OK

which I then dump all info into a list for further processing:然后我将所有信息转储到列表中以供进一步处理：

cleaned_lines = []
cleaned_lines.append({
                "timestamp":            l[0],
                "client_request_line":  l[1],
                "status":               l[2],
                "bytes_sent":           l[3],
                "referer":              l[4],
                "user_agent":           l[5],
                "session_id":           l[6],
})

Problem :问题：

There exists sometime some lines with broken url (referer) (starting with http://192.168.8.1/ ) similar to:有时存在一些带有损坏的 url (referer) 的行（以http://192.168.8.1/开头）类似于：

sample_apache_access_log_line = '- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995'

which I would like to manipulate them using regex to say always start with http://LETTERS , that is why I changed my code to:我想使用正则表达式来操纵它们，说总是以http://LETTERS开头，这就是为什么我将代码更改为：

CUSTOM_LOG_PATTERN = '- - \[.*?\] "(http://[a-zA-Z].*)" (\d{3}) (.*) "([^\"]+)" "(.*?)" (.*)' 
                                    <<<<<PROBLEM>>>>>
matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
print (matched_line)
l = matched_line.groups() # ERROR
print(l)

But then here it comes the error:但随后出现了错误：

AttributeError                            Traceback (most recent call last)

<ipython-input-88-c7a93cfbce61> in <module>
      4 matched_line = re.match(CUSTOM_LOG_PATTERN, sample_apache_access_log_line)
      5 print (matched_line)
----> 6 l = matched_line.groups()
      7 print(l)

AttributeError: 'NoneType' object has no attribute 'groups'

Is there anything I'm doing wrong between for re.match().groups() ?我在re.match().groups()之间做错了什么吗？

Answer 1

Use re.findall() then re.split() .使用re.findall()然后re.split() 。

pattern = '(http://\D.*)'                        #matches any non-digits after 'http://'
url_start = re.findall(pattern, log_file_string) #get the starting point of url
url = re.split("\s", url_start)                  #to get the url alone by 
                                                 #splitting on whitespace

url = url[0]

You may need to use str.strip() to remove any remaining special characters enclosing the url.您可能需要使用str.strip()删除包含 url 的所有剩余特殊字符。

If you must use re.match() try simplifying the pattern.如果您必须使用re.match()尝试简化模式。

pattern = '(.*)"(http://\D.*)"(.*)'
url_start = re.match(pattern, log_file_string)
url_string = url_start.group(2)
url = re.split("\s", url_string)
url = url[0].strip('"')

Using Match.groups() returns a tuple.使用Match.groups()返回一个元组。 Used Match.group() above.在上面使用Match.group() 。 Try:尝试：

pattern = '(.*)"(http://\D.*)"[^\"]'
url_start = re.match(pattern, log_file_string)
url = url_start.groups(2)
url = url[1]

Answer 2

if url is all you need you can just use split()如果 url 是你所需要的，你可以使用split()

sample_apache_access_log_line = [

'- - [01/Feb/2017:12:34:51 +0200] "GET /aikakausi/binding/499213?page=55 HTTP/1.1" 401 1612 "http://192.168.8.1/html/home.html?url&address=http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" 1995', 
'- - [01/Feb/2017:12:34:53 +0200] "GET /aikakausi/binding/641892?term=PETSAMON&term=Petsamon&page=6 HTTP/1.1" 200 3162 "http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0" 418'
]


for i in sample_apache_access_log_line:
    if 'address=' in i:
        print(i.split('"')[3].split('address=')[1])
    else:
        print(i.split('"')[3])

# http://digi.kansalliskirjasto.fi/aikakausi/binding/499213?page=55
# http://digi.kansalliskirjasto.fi/aikakausi/search?query=petsamo&requireAllKeywords=true&fuzzy=false&hasIllustrations=false&startDate=1918-12-30&endDate=1920-12-30&orderBy=RELEVANCE&pages=&page=5

在 python 中使用 re.match().groups() 解析 url

问题描述

2 个解决方案

解决方案1
1

解决方案2
0 2022-10-04 12:12:50

在 python 中使用 re.match().groups() 解析 url

问题描述

2 个解决方案

解决方案1 1

解决方案2 0 2022-10-04 12:12:50

解决方案1
1

解决方案2
0 2022-10-04 12:12:50