什么缺少这个正则表达式来匹配 apache 日志的行？

Question

I have these lines我有这些线

5.10.80.69 - - [21/Jun/2019:15:46:20 -0700] "PATCH /niches/back-end HTTP/2.0" 406 15834
11.57.203.39 - carroll8889 [21/Jun/2019:15:46:21 -0700] "HEAD /visionary/cultivate HTTP/1.1" 404 15391
124.137.187.175 - - [21/Jun/2019:15:46:22 -0700] "DELETE /expedite/exploit/cultivate/web-enabled HTTP/1.0" 403 2606
203.36.55.39 - collins6322 [21/Jun/2019:15:46:23 -0700] "PATCH /efficient/productize/disintermediate HTTP/1.1" 504 13377
175.5.52.40 - - [21/Jun/2019:15:46:24 -0700] "POST /real-time HTTP/1.1" 200 2660
232.220.131.214 - - [21/Jun/2019:15:46:25 -0700] "GET /wireless/matrix/synergistic/expedite HTTP/1.1" 205 15081
87.234.209.125 - labadie6990 [21/Jun/2019:15:46:26 -0700] "GET /unleash/aggregate HTTP/2

and I need to put them in an array like this:我需要将它们放在这样的数组中：

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}

This is what I have done:这就是我所做的：

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        return logdata
    
partes = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

pattern = re.compile(r'\s+'.join(partes)+r'\s*\Z')

log_data = []

for line in logs():
  log_data.append(pattern.match(line).groupdict())
    
print (log_data)

But I have this errror:但我有这个错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-029948b6e367> in <module>
     23 # Get components from each line of the log file into a structured dict
     24 for line in logs():
---> 25   log_data.append(pattern.match(line).groupdict())
     26 
     27 

AttributeError: 'NoneType' object has no attribute 'groupdict'

This error is obviusly because the regex is wrong, but not sure why, the code is taken from here:这个错误显然是因为正则表达式错误，但不知道为什么，代码取自这里：

https://gist.github.com/sumeetpareek/9644255 https://gist.github.com/sumeetpareek/9644255

Update:更新：

    import re
    def logs():
        with open("assets/logdata.txt", "r") as file:
            logdata = file.read()
            return logdata

regex="^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$"

log_data = []

for line in logs():
    m = pattern.match(line)
    log_data.append(re.findall(regex, line).groupdict())
    
print (log_data)

But I get this error:unexpected character after line continuation character但我收到此错误：行继续字符后出现意外字符

Update 2:更新 2：

when adding the items to a dictionary, the items must arrive in this format:将项目添加到字典时，项目必须以这种格式到达：

assert len(logs()) == 979断言 len(logs()) == 979

one_item={'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"

Answer 1

Since there are a lot of issues with the solution you have, please consider revamping it completely.由于您的解决方案存在很多问题，请考虑彻底修改它。

The regex that should work for you is应该为您工作的正则表达式是

^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$

See the regex demo .请参阅正则表达式演示。 Note the last (?: +"([^"]*)"(?: +"([^"]*)")?)?注意最后一个(?: +"([^"]*)"(?: +"([^"]*)")?)? part matches two optional sequences of patterns and the last one is only matched if the first is matched. part 匹配两个可选的模式序列，最后一个仅在第一个匹配时才匹配。

The code you can leverage may look like您可以利用的代码可能看起来像

import re

pattern = re.compile(r'''^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$''')

log_data = []

with open("assets/logdata.txt", "r") as file:
  for line in file:
    m = pattern.search(line.strip())
    if m:
      log_data.append(m.groupdict())

print(log_data)

See the Python demo请参阅 Python 演示

什么缺少这个正则表达式来匹配 apache 日志的行？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-25 10:04:38

什么缺少这个正则表达式来匹配 apache 日志的行？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-25 10:04:38

解决方案1
1 已采纳 2021-01-25 10:04:38