使用正则表达式从使用 Python 的日志文件中提取文件名

Question

looking to create a list of files accessed from a log file.希望创建从日志文件访问的文件列表。 Two examples of strings from the file are shown below.文件中的两个字符串示例如下所示。

.... [08/Mar/2020:19:11:15 -0700] "GET /socview/15Ragged.htm HTTP/1.1" 200 6564 ......... .... [08/Mar/2020:19:11:15 -0700] "GET /socview/15Ragged.htm HTTP/1.1" 200 6564 .........

.... [08/Mar/2020:19:11:31 -0700] "GET /socview/?C=D;O=A HTTP/1.1" 200 13443 .............. .... [08/Mar/2020:19:11:31-0700] "GET /socview/?C=D;O=A HTTP/1.1" 200 13443 ........ ..

/socview/15Ragged.htm is what i'm looking to extract ie. /socview/15Ragged.htm 是我想要提取的内容，即。 ending in a .htm .log .txt etc.以 .htm .log .txt 等结尾。

/socview/?C=D;O=A is what i'm trying to avoid extracting. /socview/?C=D;O=A 是我试图避免提取的内容。

It seems that the "."好像是“。” is whats causing issues, as when I run the code without searching for it, ie.是什么导致了问题，就像我在没有搜索的情况下运行代码一样，即。 the RE below runs perfectly as part of the loop shown at the bottom of this post.下面的 RE 作为本文底部显示的循环的一部分完美运行。

unique = re.search(r'GET (\S+)', x)

However it is extracting strings I do not want.但是它正在提取我不想要的字符串。 Below is the loop and RE that I'm trying to use, it makes sense to me and I cant figure out whats wrong, when ran the message below is displayed.下面是我正在尝试使用的循环和 RE，这对我来说很有意义，我无法弄清楚出了什么问题，当运行时显示下面的消息。 Any help would be greatly appreciated任何帮助将不胜感激

"if unique.group(1) not in unilist: “如果 unique.group(1) 不在单列表中：

AttributeError: 'NoneType' object has no attribute 'group'" AttributeError: 'NoneType' 对象没有属性 'group'"

for x in input:
     unique = re.search(r'GET (\S+\.\S+)', x)

     if unique.group(1) not in unilist:
           unilist.append(unique.group(1))

Answer 1

The GET (\\S+\\.\\S+) is fine. GET (\\S+\\.\\S+)很好。 The problem is that the re.search() returns None if the match has failed, so for the second string you provided the unique is None which does not have a group property.问题在于，如果匹配失败，则re.search()返回None ，因此对于您提供的第二个字符串， unique是None ，它没有group属性。

Try the following:请尝试以下操作：

for x in input:
    unique = re.search(r'GET (\S+\.\S+)', x)

    if unique is None:
        continue

    if unique.group(1) not in unilist:
           unilist.append(unique.group(1))

I do recommend you to use better variable names.我建议您使用更好的变量名称。 For example input is a build-in in Python, avoid shadowing it.例如input是 Python 中的内置，避免隐藏它。 If the loop body grows, it will be hard to follow names like x .如果循环体增长，将很难遵循x名称。

Also, I recommend to pre-compile the regex like this, otherwise it compiles it in every cycle which is very time-consuming:另外，我建议像这样预编译正则表达式，否则它会在每个循环中编译它，这非常耗时：

matcher = re.compile("GET (\S+\.\S+)")

for line in lines:
    # your loop body here

使用正则表达式从使用 Python 的日志文件中提取文件名

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-31 17:29:41

使用正则表达式从使用 Python 的日志文件中提取文件名

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-31 17:29:41

解决方案1
0 已采纳 2020-03-31 17:29:41