在Python中分析文本文件

Question

I have a text file that needs to be analysed. 我有一个需要分析的文本文件。 Each line in the file is of this form: 文件中的每一行都是这种形式：

7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  

7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))

7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3

I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. 我需要跳过时间戳和(slbfd) ，只保留输入和输出的行数。 Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. 此外，根据引号中的名称，如果一行以OUT开头，则需要为不同变量增加变量计数，否则应减少变量计数。 How would I go about doing this in Python? 我将如何在Python中执行此操作？

Answer 1

The other answers with regex and splitting the line will get the job done , but if you want a fully maintainable solution that will grow with you, you should build a grammar. 使用正则表达式和分隔线的其他答案可以完成工作 ，但是如果您想要一个可以随您发展的完全可维护的解决方案，则应该构建语法。 I love pyparsing for this: 我喜欢pyparsing做pyparsing ：

S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''

from pyparsing import *
from collections import defaultdict

# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")

line    = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))

# Now parsing is a piece of cake!  
P = grammar.parseString(S)
counts = defaultdict(int)

for x in P:
    if x.flag=="IN": counts[x.name] += 1
    if x.flag=="OUT": counts[x.name] -= 1

for key in counts:
    print key, counts[key]

This gives as output: 这给出了输出：

lq_viz_server 1
OFM32 -1

Which would look more impressive if your sample log file was longer. 如果您的示例日志文件更长，这看起来会更令人印象深刻。 The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). pyparsing解决方案的优点是将来能够适应更复杂的查询（例如，抓取并解析时间戳，提取电子邮件地址，解析错误代码...）。 The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage. 这个想法是，您编写与查询无关的语法-您只需将原始文本转换为计算机友好格式，就可以将解析实现从使用中抽象出来。

Answer 2

You have two options: 您有两种选择：

Use the .split() function of the string (as pointed out in the comments) 使用string的.split()函数（如注释中所指出）
Use the re module for regular expressions. 将re模块用于正则表达式。

I would suggest using the re module and create a pattern with named groups. 我建议使用re模块并使用命名组创建模式。

Recipe: 食谱：

first create a pattern with re.compile() containing named groups 首先使用re.compile()创建一个包含命名组的模式
do a for loop over the file to get the lines use .match() od the 对文件进行for循环以使行使用.match()或
created pattern object on each line use .groupdict() of the 在每一行上创建的模式对象使用.groupdict()
returned match object to access your values of interest 返回的匹配对象以访问您感兴趣的值

Answer 3

If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. 如果我认为文件分为几行（我不知道它是否为真），则必须将split()函数应用于每一行。 You will have this: 您将拥有：

["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela@nabltas1"]

And then I think you have to be capable of apply any logic comparing the values that you need. 然后，我认为您必须能够应用比较所需值的任何逻辑。

Answer 4

i made some wild assumptions about your specification and here is a sample code to help you start: 我对您的规格做出了一些疯狂的假设，下面是一个示例代码来帮助您开始：

objects = {}
with open("data.txt") as data:
    for line in data:
        if "IN:" in line or "OUT:" in line:
            try:
                name = line.split("\"")[1]
            except IndexError:
                print("No double quoted name on line: {}".format(line))
                name = "PARSING_ERRORS"
            if "OUT:" in line:
                diff = 1
            else:
                diff = -1
            try:
                objects[name] += diff
            except KeyError:
                objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names

Answer 5

In the mode of just get 'er done with the standard distribution, this works: 在仅完成标准分发的模式下，此方法有效：

import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
    match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
    if match:
        if match.group(1) == 'IN': count[match.group(2)]+=1
        elif match.group(1) == 'OUT': count[match.group(2)]-=1

print(count)

Prints: 印刷品：

Counter({'lq_viz_server': 1, 'OFM32': -1})

在Python中分析文本文件

问题描述

5 个解决方案

解决方案1
5 2012-06-22 14:39:53

解决方案2
1 2012-06-22 14:29:00

解决方案3
1 2012-06-22 14:29:10

解决方案4
1 2012-06-22 14:31:49

解决方案5
0 2012-06-22 15:29:28

在Python中分析文本文件

问题描述

5 个解决方案

解决方案1 5 2012-06-22 14:39:53

解决方案2 1 2012-06-22 14:29:00

解决方案3 1 2012-06-22 14:29:10

解决方案4 1 2012-06-22 14:31:49

解决方案5 0 2012-06-22 15:29:28

解决方案1
5 2012-06-22 14:39:53

解决方案2
1 2012-06-22 14:29:00

解决方案3
1 2012-06-22 14:29:10

解决方案4
1 2012-06-22 14:31:49

解决方案5
0 2012-06-22 15:29:28