简体   繁体   English

使用正则表达式从使用 Python 的日志文件中提取文件名

[英]Using a regular expression to extract a file name from a logfile using Python

looking to create a list of files accessed from a log file.希望创建从日志文件访问的文件列表。 Two examples of strings from the file are shown below.文件中的两个字符串示例如下所示。

.... [08/Mar/2020:19:11:15 -0700] "GET /socview/15Ragged.htm HTTP/1.1" 200 6564 ......... .... [08/Mar/2020:19:11:15 -0700] "GET /socview/15Ragged.htm HTTP/1.1" 200 6564 .........

.... [08/Mar/2020:19:11:31 -0700] "GET /socview/?C=D;O=A HTTP/1.1" 200 13443 .............. .... [08/Mar/2020:19:11:31-0700] "GET /socview/?C=D;O=A HTTP/1.1" 200 13443 ........ ..

/socview/15Ragged.htm is what i'm looking to extract ie. /socview/15Ragged.htm 是我想要提取的内容,即。 ending in a .htm .log .txt etc.以 .htm .log .txt 等结尾。

/socview/?C=D;O=A is what i'm trying to avoid extracting. /socview/?C=D;O=A 是我试图避免提取的内容。

It seems that the "."好像是“。” is whats causing issues, as when I run the code without searching for it, ie.是什么导致了问题,就像我在没有搜索的情况下运行代码一样,即。 the RE below runs perfectly as part of the loop shown at the bottom of this post.下面的 RE 作为本文底部显示的循环的一部分完美运行。

unique = re.search(r'GET (\S+)', x)

However it is extracting strings I do not want.但是它正在提取我不想要的字符串。 Below is the loop and RE that I'm trying to use, it makes sense to me and I cant figure out whats wrong, when ran the message below is displayed.下面是我正在尝试使用的循环和 RE,这对我来说很有意义,我无法弄清楚出了什么问题,当运行时显示下面的消息。 Any help would be greatly appreciated任何帮助将不胜感激

"if unique.group(1) not in unilist: “如果 unique.group(1) 不在单列表中:

AttributeError: 'NoneType' object has no attribute 'group'" AttributeError: 'NoneType' 对象没有属性 'group'"

for x in input:
     unique = re.search(r'GET (\S+\.\S+)', x)

     if unique.group(1) not in unilist:
           unilist.append(unique.group(1))

The GET (\\S+\\.\\S+) is fine. GET (\\S+\\.\\S+)很好。 The problem is that the re.search() returns None if the match has failed, so for the second string you provided the unique is None which does not have a group property.问题在于,如果匹配失败,则re.search()返回None ,因此对于您提供的第二个字符串, uniqueNone ,它没有group属性。

Try the following:请尝试以下操作:

for x in input:
    unique = re.search(r'GET (\S+\.\S+)', x)

    if unique is None:
        continue

    if unique.group(1) not in unilist:
           unilist.append(unique.group(1))

I do recommend you to use better variable names.我建议您使用更好的变量名称。 For example input is a build-in in Python, avoid shadowing it.例如input是 Python 中的内置,避免隐藏它。 If the loop body grows, it will be hard to follow names like x .如果循环体增长,将很难遵循x名称。

Also, I recommend to pre-compile the regex like this, otherwise it compiles it in every cycle which is very time-consuming:另外,我建议像这样预编译正则表达式,否则它会在每个循环中编译它,这非常耗时:

matcher = re.compile("GET (\S+\.\S+)")

for line in lines:
    # your loop body here

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM