什么缺少这个正则表达式来匹配 apache 日志的行？

Question

我有这些线

5.10.80.69 - - [21/Jun/2019:15:46:20 -0700] "PATCH /niches/back-end HTTP/2.0" 406 15834
11.57.203.39 - carroll8889 [21/Jun/2019:15:46:21 -0700] "HEAD /visionary/cultivate HTTP/1.1" 404 15391
124.137.187.175 - - [21/Jun/2019:15:46:22 -0700] "DELETE /expedite/exploit/cultivate/web-enabled HTTP/1.0" 403 2606
203.36.55.39 - collins6322 [21/Jun/2019:15:46:23 -0700] "PATCH /efficient/productize/disintermediate HTTP/1.1" 504 13377
175.5.52.40 - - [21/Jun/2019:15:46:24 -0700] "POST /real-time HTTP/1.1" 200 2660
232.220.131.214 - - [21/Jun/2019:15:46:25 -0700] "GET /wireless/matrix/synergistic/expedite HTTP/1.1" 205 15081
87.234.209.125 - labadie6990 [21/Jun/2019:15:46:26 -0700] "GET /unleash/aggregate HTTP/2

我需要将它们放在这样的数组中：

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}

这就是我所做的：

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        return logdata
    
partes = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

pattern = re.compile(r'\s+'.join(partes)+r'\s*\Z')

log_data = []

for line in logs():
  log_data.append(pattern.match(line).groupdict())
    
print (log_data)

但我有这个错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-029948b6e367> in <module>
     23 # Get components from each line of the log file into a structured dict
     24 for line in logs():
---> 25   log_data.append(pattern.match(line).groupdict())
     26 
     27 

AttributeError: 'NoneType' object has no attribute 'groupdict'

这个错误显然是因为正则表达式错误，但不知道为什么，代码取自这里：

https://gist.github.com/sumeetpareek/9644255

更新：

    import re
    def logs():
        with open("assets/logdata.txt", "r") as file:
            logdata = file.read()
            return logdata

regex="^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$"

log_data = []

for line in logs():
    m = pattern.match(line)
    log_data.append(re.findall(regex, line).groupdict())
    
print (log_data)

但我收到此错误：行继续字符后出现意外字符

更新 2：

将项目添加到字典时，项目必须以这种格式到达：

断言 len(logs()) == 979

one_item={'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"

Answer 1

由于您的解决方案存在很多问题，请考虑彻底修改它。

应该为您工作的正则表达式是

^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$

请参阅正则表达式演示。 注意最后一个(?: +"([^"]*)"(?: +"([^"]*)")?)? part 匹配两个可选的模式序列，最后一个仅在第一个匹配时才匹配。

您可以利用的代码可能看起来像

import re

pattern = re.compile(r'''^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$''')

log_data = []

with open("assets/logdata.txt", "r") as file:
  for line in file:
    m = pattern.search(line.strip())
    if m:
      log_data.append(m.groupdict())

print(log_data)

请参阅 Python 演示

什么缺少这个正则表达式来匹配 apache 日志的行？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-25 10:04:38

什么缺少这个正则表达式来匹配 apache 日志的行？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-25 10:04:38

解决方案1
1 已采纳 2021-01-25 10:04:38