[英]whats missing this regex to match the lines of apache logs?
我有这些线
5.10.80.69 - - [21/Jun/2019:15:46:20 -0700] "PATCH /niches/back-end HTTP/2.0" 406 15834
11.57.203.39 - carroll8889 [21/Jun/2019:15:46:21 -0700] "HEAD /visionary/cultivate HTTP/1.1" 404 15391
124.137.187.175 - - [21/Jun/2019:15:46:22 -0700] "DELETE /expedite/exploit/cultivate/web-enabled HTTP/1.0" 403 2606
203.36.55.39 - collins6322 [21/Jun/2019:15:46:23 -0700] "PATCH /efficient/productize/disintermediate HTTP/1.1" 504 13377
175.5.52.40 - - [21/Jun/2019:15:46:24 -0700] "POST /real-time HTTP/1.1" 200 2660
232.220.131.214 - - [21/Jun/2019:15:46:25 -0700] "GET /wireless/matrix/synergistic/expedite HTTP/1.1" 205 15081
87.234.209.125 - labadie6990 [21/Jun/2019:15:46:26 -0700] "GET /unleash/aggregate HTTP/2
我需要将它们放在这样的数组中:
example_dict = {"host":"146.204.224.152",
"user_name":"feest6811",
"time":"21/Jun/2019:15:45:24 -0700",
"request":"POST /incentivize HTTP/1.1"}
这就是我所做的:
import re
def logs():
with open("assets/logdata.txt", "r") as file:
logdata = file.read()
return logdata
partes = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.*)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referrer>.*)"', # referrer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(partes)+r'\s*\Z')
log_data = []
for line in logs():
log_data.append(pattern.match(line).groupdict())
print (log_data)
但我有这个错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-029948b6e367> in <module>
23 # Get components from each line of the log file into a structured dict
24 for line in logs():
---> 25 log_data.append(pattern.match(line).groupdict())
26
27
AttributeError: 'NoneType' object has no attribute 'groupdict'
这个错误显然是因为正则表达式错误,但不知道为什么,代码取自这里:
https://gist.github.com/sumeetpareek/9644255
更新:
import re
def logs():
with open("assets/logdata.txt", "r") as file:
logdata = file.read()
return logdata
regex="^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$"
log_data = []
for line in logs():
m = pattern.match(line)
log_data.append(re.findall(regex, line).groupdict())
print (log_data)
但我收到此错误:行继续字符后出现意外字符
更新 2:
将项目添加到字典时,项目必须以这种格式到达:
断言 len(logs()) == 979
one_item={'host': '146.204.224.152',
'user_name': 'feest6811',
'time': '21/Jun/2019:15:45:24 -0700',
'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"
由于您的解决方案存在很多问题,请考虑彻底修改它。
应该为您工作的正则表达式是
^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$
请参阅正则表达式演示。 注意最后一个(?: +"([^"]*)"(?: +"([^"]*)")?)?
part 匹配两个可选的模式序列,最后一个仅在第一个匹配时才匹配。
您可以利用的代码可能看起来像
import re
pattern = re.compile(r'''^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$''')
log_data = []
with open("assets/logdata.txt", "r") as file:
for line in file:
m = pattern.search(line.strip())
if m:
log_data.append(m.groupdict())
print(log_data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.