简体   繁体   English

重新格式化Python中的文本

[英]Reformatting text in Python

I am new to python and having great difficulty parsing through a log file. 我是python新手,很难通过日志文件进行解析。 Can you please help me understand how I can accomplish the below in the most Pythonic way. 您能否帮助我了解如何以最Python的方式完成以下任务。

----- Log Entry 5 -----
Time       : 2016-07-12 09:00:00
Animal     : Brown Bear
Bird       : White Owl
Fish       : Salmon


----- Log Entry 6 -----
Time       : 2016-07-12 09:00:00
Animal     : Brown Bear
Bird       : Parrot
Fish       : Tuna


----- Log Entry 7 -----
Time       : 2016-07-12 09:00:00
Animal     : Lion
Bird       : White Owl
Fish       : Sword Fish


----- Log Entry 8 -----
Time       : 2016-07-12 09:15:00
Animal     : Lion
Bird       : White Owl
Fish       : Sword Fish

Desired Output 1: I would like to reformat the log to look like the below: 所需的输出1:我想重新格式化日志,使其看起来如下所示:

Time: 2016-07-12 09:00:00 Animal: Brown Bear  Bird: White Owl  Fish : Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear  Bird: Parrot     Fish : Tuna
Time: 2016-07-12 09:00:00 Animal: Lion        Bird: White Owl  Fish : Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion        Bird: White Owl  Fish : Sword Fish

Desired Output 2: Then I would like to have the ability to query a time stamp and get a summary of counts: 所需的输出2:然后,我希望能够查询时间戳并获得计数摘要:

Time: 2016-07-12 09:00:00
Name:       Count:
Brown Bear  2
Lion        1
White Owl   2
Parrot      1
Salmon      1
Tuna        1
Sword Fish  1

Time: 2016-07-12 09:15:00
Name:       Count:
Lion        1
White Owl   1
Sword Fish  1

My Code So Far: 到目前为止,我的代码:

import os, sys, time, re, collections, subprocess

show_cmd = 'cat question |  egrep -v \'^$|=|Log\' | awk \'ORS=NR%4?FS:RS\' | grep Time'
log = (subprocess.check_output(show_cmd, shell=True).decode('utf-8'))

def time_field():
    logRegex = re.compile(r'Time\s*:.*\d\d\d-\d\d-\d\d\s\d\d:\d\d')
    log_parsed = (logRegex.findall(log))
    a = (str(log_parsed).replace('  ', ''))
    a = ((' ' + a[1:-1]).split(','))
    for i in a:
        print(i)

time_field()

There are a lot of ways to do this. 有很多方法可以做到这一点。 Personally I would avoid using regex for this because it probably won't be more efficient and the expression becomes cumbersome and inflexible. 就我个人而言,我会避免使用正则表达式,因为它可能不会更有效,并且表达式变得笨拙和僵化。 Here is something I came up with: 这是我想到的:

class Entry:
    def __init__(self):
        self.time = None
        self.animal = None
        self.bird = None
        self.fish = None

    def __repr__(self):
        fmt = "{0} {1} {2} {3}".format(
            "Time: {time: <{width}}",
            "Animal: {animal: <{width}}",
            "Bird: {bird: <{width}}",
            "Fish: {fish: <{width}}")
        return fmt.format(
            time=self.time, animal=self.animal,
            bird=self.bird, fish=self.fish,
            width=12)

    def __radd__(self, other):
            return self.__add__(other)

    def __add__(self, other):
        if type(other) == dict:
            for i in [self.animal, self.bird, self.fish]:
                if i in other: other[i] += 1
                else: other[i] = 1
            return other
        elif type(other) == Entry:
            return self.__add__({}) + other
        else:
            return self.__add__({})

def parse_log(path):
    def extract(line):
        start = line.find(':') + 1
        return line[start:].strip()

    entries = []
    entry = None
    with open(path, 'r') as f:
        for line in f.readlines():
            if line.startswith('-----'):
                if entry: entries.append(entry)
                entry = Entry()
            elif line.startswith('Time'):
                entry.time = extract(line)
            elif line.startswith('Animal'):
                entry.animal = extract(line)
            elif line.startswith('Bird'):
                entry.bird = extract(line)
            elif line.startswith('Fish'):
                entry.fish = extract(line)

        if entry: entries.append(entry)

    return entries


def print_output_1(entries):
    for entry in entries:
        print entry

def print_output_2(entries, time):
    animals = sum([e for e in entries if e.time == time])

    print "Time: {0}".format(time)
    print "Name:        Count:"
    for animal, count in animals.items():
        print "{animal: <{width}} {count}".format(
                animal=animal, count=count, width=12)


logPath = 'log.log'
time = '2016-07-12 09:15:00'
entries = parse_log(logPath)

print_output_1(entries)
print ""
print_output_2(entries, time)

The output (given that log.log matches the input you gave) is: 输出(假设log.log与您提供的输入匹配)是:

Time: 2016-07-12 09:00:00 Animal: Brown Bear   Bird: White Owl    Fish: Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear   Bird: Parrot       Fish: Tuna
Time: 2016-07-12 09:00:00 Animal: Lion         Bird: White Owl    Fish: Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion         Bird: White Owl    Fish: Sword Fish

Time: 2016-07-12 09:15:00
Name:        Count:
White Owl    1
Sword Fish   1
Lion         1

The way this code works is to use object oriented programming to our advantage in order to simplify the tasks we need to do: store log entries, represent log entries in a specific format, and combine log entries according to a specific property. 这段代码的工作方式是利用我们的优势来使用面向对象的编程,以简化我们需要完成的任务:存储日志条目,以特定格式表示日志条目,并根据特定属性组合日志条目。

First, note that the Entry object and its properties ( self.time , self.animal , self.bird , self.fish ) represents an entry in the log. 首先,请注意Entry对象及其属性( self.timeself.animalself.birdself.fish )表示日志中的一个条目。 Assuming that the information stored in its properties is correct, a method can be created to represent that information as a formatted string. 假设存储在其属性中的信息是正确的,则可以创建一种方法来将该信息表示为格式化字符串。 The method __repr__() is called when python wants the string representation of an object, so it seemed like a good place to put this code. 当python需要对象的字符串表示形式时,将调用__repr__()方法,因此,放置该代码似乎是个好地方。 There is heavy use of the format function in this method, but it should be clear how it works after browsing the python documentation on format . 在此方法中大量使用了format函数,但是在浏览format上的python文档后应该清楚它是如何工作的。

A method for combining these entry objects is needed in order to get the second output you specified. 需要一种用于组合这些条目对象的方法,以便获得您指定的第二个输出。 This can be done many ways and the way I did it is not necessarily the best. 这可以用很多方法完成,而我做的方法不一定是最好的。 I used the __radd__() and __add__() methods which are called when the + operator is used on an object. 我使用了__radd__()__add__()方法,这些方法在对象上使用+运算符时会被调用。 By doing this, the code entry1 + entry2 or sum([entry1, entry2]) can be used to get the sum of the animals in both entries. 这样,可以使用代码entry1 + entry2sum([entry1, entry2])来获取两个条目中动物的总和。 The Entry class can not be used to store the result of the sum, however, because it cannot contain arbitrary information. 但是, Entry类不能用于存储总和的结果,因为它不能包含任意信息。 Instead, I chose to use a dict object to be the result of summing two Entry objects. 取而代之的是,我选择使用dict对象作为两个Entry对象求和的结果。 In order to sum more than two Entry objects, Entry must also be able to sum with a dict object because Entry + Entry + Entry results in dict + Entry . 为了对两个以上的Entry对象求和, Entry还必须能够与dict对象求和,因为Entry + Entry + Entry导致dict + Entry

The __add__() function checks if the object it is being added to is a dict object. __add__()函数检查要添加到该对象的对象是否是dict对象。 If this is the case, it checks if each of the animals in the entry exist in the dict already. 如果是这样,它将检查条目中的每个动物是否已经存在于dict If not, it will add the animal as a key. 如果没有,它将添加动物作为关键。 Otherwise, it will increment the value of that key. 否则,它将增加该键的值。 __radd__() is similar to __add__() except that it is used in some special circumstances. __radd__()类似于__add__()不同的是它在某些特殊情况下使用。 See the python documentation for more information. 有关更多信息,请参见python文档。

For the case where the object is an Entry , code could have been written to gather all of the animals from each Entry object and create a dict from that information, but since there is already code to add an Entry with a dict it is easier to first add one object to an empty dict and then add the resulting dict with the other Entry object. 对于对象是Entry ,可以编写代码以从每个Entry对象收集所有动物并根据该信息创建dict ,但是由于已有代码添加带有dictEntry ,因此更容易首先将一个对象添加到空dict ,然后将结果dict与另一个Entry对象添加。

For all other objects, the Entry will simply return the dict description of itself, or itself added with an empty dict . 对于所有其他对象, Entry将仅返回其自身的dict说明,或者自身添加有空dict

Now all of the tools exist to accomplish the goals listed earlier. 现在,所有工具都可以实现前面列出的目标。 To get a string representation of an Entry that matches desired output 1, all that is needed is print entry or strrepr = str(entry) . 要获得与所需输出1匹配的Entry的字符串表示形式,所需的全部是print entrystrrepr = str(entry) To get desired output 2, a little more work is involved, but it is simply summing all entries that have the same self.time property and then displaying the resulting dict. 为了获得所需的输出2,需要进行更多的工作,但这只是将具有相同self.time属性的所有条目求和,然后显示结果dict。

The last part of the code not covered is the parsing of the log to create a list of Entry objects. 未覆盖的代码的最后一部分是对日志的解析,以创建Entry对象的列表。 The code simply walks line by line through the log and populates an Entry with the information. 代码只是简单地在日志中逐行浏览,并使用信息填充Entry I feel like this is pretty straightforward, but you can feel free to ask questions if it does not make sense. 我觉得这很简单,但是如果没有意义,您可以随时提出问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM