简体   繁体   English

在Python中分析文本文件

[英]Analysing a text file in Python

I have a text file that needs to be analysed. 我有一个需要分析的文本文件。 Each line in the file is of this form: 文件中的每一行都是这种形式:

7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  

7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))

7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3

I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. 我需要跳过时间戳和(slbfd) ,只保留输入和输出的行数。 Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. 此外,根据引号中的名称,如果一行以OUT开头,则需要为不同变量增加变量计数,否则应减少变量计数。 How would I go about doing this in Python? 我将如何在Python中执行此操作?

The other answers with regex and splitting the line will get the job done , but if you want a fully maintainable solution that will grow with you, you should build a grammar. 使用正则表达式和分隔线的其他答案可以完成工作 ,但是如果您想要一个可以随您发展的完全可维护的解决方案,则应该构建语法。 I love pyparsing for this: 我喜欢pyparsingpyparsing

S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''

from pyparsing import *
from collections import defaultdict

# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")

line    = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))

# Now parsing is a piece of cake!  
P = grammar.parseString(S)
counts = defaultdict(int)

for x in P:
    if x.flag=="IN": counts[x.name] += 1
    if x.flag=="OUT": counts[x.name] -= 1

for key in counts:
    print key, counts[key]

This gives as output: 这给出了输出:

lq_viz_server 1
OFM32 -1

Which would look more impressive if your sample log file was longer. 如果您的示例日志文件更长,这看起来会更令人印象深刻。 The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). pyparsing解决方案的优点是将来能够适应更复杂的查询(例如,抓取并解析时间戳,提取电子邮件地址,解析错误代码...)。 The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage. 这个想法是,您编写与查询无关的语法-您只需将原始文本转换为计算机友好格式,就可以将解析实现从使用中抽象出来。

You have two options: 您有两种选择:

  1. Use the .split() function of the string (as pointed out in the comments) 使用string.split()函数(如注释中所指出)
  2. Use the re module for regular expressions. re模块用于正则表达式。

I would suggest using the re module and create a pattern with named groups. 我建议使用re模块并使用命名组创建模式。

Recipe: 食谱:

  • first create a pattern with re.compile() containing named groups 首先使用re.compile()创建一个包含命名组的模式
  • do a for loop over the file to get the lines use .match() od the 对文件进行for循环以使行使用.match()
  • created pattern object on each line use .groupdict() of the 在每一行上创建的模式对象使用.groupdict()
  • returned match object to access your values of interest 返回的匹配对象以访问您感兴趣的值

If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. 如果我认为文件分为几行(我不知道它是否为真),则必须将split()函数应用于每一行。 You will have this: 您将拥有:

["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela@nabltas1"]  

And then I think you have to be capable of apply any logic comparing the values that you need. 然后,我认为您必须能够应用比较所需值的任何逻辑。

i made some wild assumptions about your specification and here is a sample code to help you start: 我对您的规格做出了一些疯狂的假设,下面是一个示例代码来帮助您开始:

objects = {}
with open("data.txt") as data:
    for line in data:
        if "IN:" in line or "OUT:" in line:
            try:
                name = line.split("\"")[1]
            except IndexError:
                print("No double quoted name on line: {}".format(line))
                name = "PARSING_ERRORS"
            if "OUT:" in line:
                diff = 1
            else:
                diff = -1
            try:
                objects[name] += diff
            except KeyError:
                objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names

In the mode of just get 'er done with the standard distribution, this works: 在仅完成标准分发的模式下,此方法有效:

import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
    match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
    if match:
        if match.group(1) == 'IN': count[match.group(2)]+=1
        elif match.group(1) == 'OUT': count[match.group(2)]-=1

print(count)

Prints: 印刷品:

Counter({'lq_viz_server': 1, 'OFM32': -1})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM