[英]Analysing a text file in Python
I have a text file that needs to be analysed. 我有一个需要分析的文本文件。 Each line in the file is of this form: 文件中的每一行都是这种形式:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3
I need to skip the timestamp and the (slbfd)
and only keep a count of the lines with the IN and OUT. 我需要跳过时间戳和(slbfd)
,只保留输入和输出的行数。 Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT
and decrease the variable count otherwise. 此外,根据引号中的名称,如果一行以OUT
开头,则需要为不同变量增加变量计数,否则应减少变量计数。 How would I go about doing this in Python? 我将如何在Python中执行此操作?
The other answers with regex and splitting the line will get the job done , but if you want a fully maintainable solution that will grow with you, you should build a grammar. 使用正则表达式和分隔线的其他答案可以完成工作 ,但是如果您想要一个可以随您发展的完全可维护的解决方案,则应该构建语法。 I love pyparsing
for this: 我喜欢pyparsing
做pyparsing
:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output: 这给出了输出:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. 如果您的示例日志文件更长,这看起来会更令人印象深刻。 The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). pyparsing解决方案的优点是将来能够适应更复杂的查询(例如,抓取并解析时间戳,提取电子邮件地址,解析错误代码...)。 The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage. 这个想法是,您编写与查询无关的语法-您只需将原始文本转换为计算机友好格式,就可以将解析实现从使用中抽象出来。
You have two options: 您有两种选择:
.split()
function of the string
(as pointed out in the comments) 使用string
的.split()
函数(如注释中所指出) re
module for regular expressions. 将re
模块用于正则表达式。 I would suggest using the re
module and create a pattern with named groups. 我建议使用re
模块并使用命名组创建模式。
Recipe: 食谱:
re.compile()
containing named groups 首先使用re.compile()
创建一个包含命名组的模式 for
loop over the file to get the lines use .match()
od the 对文件进行for
循环以使行使用.match()
或 .groupdict()
of the 在每一行上创建的模式对象使用.groupdict()
If I consider that the file is divided into lines (I don't know if it's true) you have to apply split()
function to each line. 如果我认为文件分为几行(我不知道它是否为真),则必须将split()
函数应用于每一行。 You will have this: 您将拥有:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela@nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need. 然后,我认为您必须能够应用比较所需值的任何逻辑。
i made some wild assumptions about your specification and here is a sample code to help you start: 我对您的规格做出了一些疯狂的假设,下面是一个示例代码来帮助您开始:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
In the mode of just get 'er done with the standard distribution, this works: 在仅完成标准分发的模式下,此方法有效:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints: 印刷品:
Counter({'lq_viz_server': 1, 'OFM32': -1})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.